RE: Help for utf-8 output

Doran, Michael D Mon, 03 Mar 2008 08:35:51 -0800

Hi Leif,

I really appreciate you taking a look at this and responding.  Although I 
consider myself somewhat knowledgeable about character sets, I still find these 
kinds of problems to be confusing.


> In this case the leader and actual length will not agree, as 
> your utf8 characters have turned into latin1.

I was under the impression that the MARC record length in the Leader was the 
record length in bytes rather than the number of characters.  Is that your 
understanding?

Also, I am still troubleshooting my particular set of records (I was out of 
town last week) since this problem only appears to manifest itself for records 
with non-ASCII characters in the 100 and 245 fields.  Records with a note field 
having non-ASCII characters doesn't cause a problem. 

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -----Original Message-----
> From: Leif Andersson [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, March 01, 2008 2:51 PM
> To: Doran, Michael D; perl4lib@perl.org; [EMAIL PROTECTED]
> Subject: Re: Help for utf-8 output
> 
> It seems there is a little bug (by design) kicking in.
> 
> The leader gets wrong and some characters get wrong in this case:
>    + Reading a raw marc record (utf8) from file
>    + Turning it into a MARC::Record object
>    + Without modification writing it out to file.
>      Yes. Even without modification the bug manifests itself!
> 
> Let's start with code simply copying one record from a file 
> utf8.mrc containing one or more marc records. This basic 
> operation not involving MARC::Record  is OK.
> 
> #!perl -w
> use strict;
> #
> open(IN, "utf8.mrc")  || die "1";
> open(OUT, ">out_good.mrc") || die "2";
> binmode IN;
> binmode OUT;
> #
> # Read in raw MARC
> $/ = "\x1D";
> my $marc = <IN>;
> print OUT $marc;
> __END__
> 
> Now, we're adding MARC::Record to the process, along with 
> some debug info.
> Example code producing *faulty* record:
> 
> #!perl -w
> use strict;
> use MARC::Record;
> use Devel::Peek;
> #
> open(IN, "utf8.mrc")  || die "1";
> open(OUT, ">out_bad.mrc") || die "2";
> binmode IN;
> binmode OUT;
> #
> # Read in raw MARC
> $/ = "\x1D";
> my $marc = <IN>;
> Dump($marc);  # the utf8-flag is not on
> my $obj  = MARC::Record->new_from_usmarc( $marc ); # Convert 
> back to raw MARC my $marc2 = $obj->as_usmarc(); Dump($marc2); 
> # the utf8-flag IS on print OUT $marc2; __END__
> 
> 
> In this case the leader and actual length will not agree, as 
> your utf8 characters have turned into latin1.
> The problem is that $marc2 has the utf8 flag set internally by Perl.
> And the conversion on output is made in spite of binmode.
> 
> We can get around the problem by either (for instance) use bytes;
>   or
> Encode::_utf8_off($marc2);
> before printing to file.
> 
> But shouldn't MARC::Record take care of this for us?
> A file of MARC records may contain records in different encodings.
> The text parts of a MARC record can be treated as made up by 
> certain encodings, but the "blob" itself, I suppose, should 
> be exposed to the caller as pure binary.
> 
> Are there any drawbacks in letting MARC::Record strip off any 
> eventual utf8 flag before returning the record as_usmarc() ?
> If not I suggest this change be made to a future release of 
> MARC::Record.
> 
> I shall also add that this character mess only sets in when doing IO.
> If you are updating your databases through one API or another 
> you are probably OK!
> 
> 
> Leif
> ======================================
> Leif Andersson, Systems Librarian
> Stockholm University Library
> SE-106 91 Stockholm
> SWEDEN
> Phone : +46 8 162769
> Mobile: +46 70 6904281
> 
> -----Ursprungligt meddelande-----
> Från: Doran, Michael D [mailto:[EMAIL PROTECTED]
> Skickat: den 21 februari 2008 18:49
> Till: perl4lib@perl.org
> Ämne: RE: Help for utf-8 output
> 
> Hi Jackie,
> 
> I'm working on a very similar problem... converting 
> theses/dissertations records (in XML) to MARC records.  I'm 
> still in the testing stage, but have had similar problems 
> with records with diacritics in the 100 or 245 fields 
> (however diacritics in a 520a field don't seem to cause any 
> problems).  Since our records are not "diacritic rich" it's 
> hard to determine the exact extent of the problem.
> 
> I am using these versions:
>   Perl v5.8.8
>   MARC::Charset 0.98
>   MARC::Lint 1.43
>   MARC::Record 2.0
>   XML::LibXML 1.66
> 
> Here's an example "bad" record (which I have minimized to 
> just the 245 field):
> 
> marcdump test.mrc
> test.mrc
> LDR 00127cam a2200037   4500
> 245 13 _aAn Empirical Test Of The Situational Leadership® 
> Model In Japan /
>        _cRiho Yoshioka.
> 
>  Recs  Errs Filename
> ----- ----- --------
>     1     1 test.mrc
> 
> When I run test.mrc through MARC::Lint, I get this message:
> 
>  Invalid record length in record 1: Leader says 00127 bytes 
> but it's actually 125  Invalid length in directory for tag 
> 245 in record 1  field does not end in end of field character 
> in tag 245 in record 1
> 
> When examined in vi the character in question, a Registered 
> Sign, appears to be correctly UTF-8 encoded C2AE, and the bib 
> Leader (position 09=a) indicates that it is Unicode encoded.  
> I've attached the MARC record.
> 
> I noticed that when I run your record (ck245.dat) through 
> MARC::Lint, I get the same invalid record length message:
> 
>  Invalid record length in record 3: Leader says 00567 bytes 
> but it's actually 569  field does not end in end of field 
> character in tag 100 in record 3  field does not end in end 
> of field character in tag 245 in record 3  Invalid indicators 
> ".10" forced to blanks in record 3 for tag 245
> 
>  field does not end in end of field character in tag 260 in 
> record 3  Invalid indicators ".  " forced to blanks in record 
> 3 for tag 260
> 
>  field does not end in end of field character in tag 300 in 
> record 3  Invalid indicators ".  " forced to blanks in record 
> 3 for tag 300
> 
>  field does not end in end of field character in tag 502 in 
> record 3  Invalid indicators ".  " forced to blanks in record 
> 3 for tag 502
> 
>  field does not end in end of field character in tag 504 in 
> record 3  Invalid indicators ".  " forced to blanks in record 
> 3 for tag 504
> 
>  field does not end in end of field character in tag 690 in 
> record 3  Invalid indicators ". 4" forced to blanks in record 
> 3 for tag 690
> 
> Anybody have any ideas?
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
>  
> 
> > -----Original Message-----
> > From: Shieh, Jackie [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, February 19, 2008 10:50 AM
> > To: perl4lib@perl.org
> > Subject: Help for utf-8 output
> > 
> > I was wondering if anyone has similar experience and has 
> come up with 
> > good solutions to help solving the challenge below?!
> > 
> > What I have is an Excel spreadsheet for dissertations which I have 
> > saved as a tab delimited file (examining the file in TextPad, the 
> > diacritics appears to be fine), then read in and output the 
> file as a 
> > utf-8 MARC file. I  <print> title field confirming author 
> field that 
> > contains diacritics with the title showing proper indicator values.
> > 
> > But when I looked the MARC itself, the fields that follow the field 
> > containing diacritics are all off its original position. 
> See attached 
> > zip file.  Examples below: first two have diacritics in a 
> 100 field, 
> > last one diacritic is in 245 subfield b)
> > 
> > 001     diss 34001
> > 100 1  _aP<E9>rez, Nancy L.
> > 245     _aSynchronic and Diachronic Matlatzinkan Phonology.
> > 
> > 001     diss 34042
> > 100 1  _aValent<ED>n-M<E1>rquez, Wilfredo
> > 245     _aDoing being boricua :
> > 
> > 001     diss 33892
> > 100 1   _aDavis, Jennifer M.
> > 245 14 _aThe Functional Complexities of Inherited Cardiac 
> Troponin I 
> > Mutations :
> >             _bIdentification of Ca<B2>+ Independent Contractile 
> > Dysfunction.
> > 
> > I would be greatly appreciate any suggestion to solve this. 
> > Thank you most kindly. 
> > 
> > Regards,
> >  
> > --Jackie
> >  
> > |Jackie Shieh
> > |Data Loads & Development
> > |Harlan Hatcher Graduate Library
> > |University of Michigan
> > |920 North University
> > |Ann Arbor, MI 48109-1205
> > |Phone: 734.763.6070 FAX: 734.615.9788
> > |E-mail: JShieh [AT] umich [DOT] edu
> > 
>

RE: Help for utf-8 output

Reply via email to