I'd suggest you first make sure your XML is really UTF-8, using JHOVE:

/path/to/jhove/jhove -c /path/to/jhove/conf/jhove.conf -m utf8-hul myFile.xml

If it fails you could convert to utf8, on the (perhaps unwarranted) assumption it's windows latin1:

   iconv -c -f windows-1252 -t UTF-8 myFile.xml > myFile.utf8.xml

Then, of course, test myFile.utf8.xml with jhove to see if it's valid.

-Brian


On February 21, at 11:48 AM, Doran, Michael D wrote:

Hi Jackie,

I'm working on a very similar problem... converting theses/ dissertations records (in XML) to MARC records. I'm still in the testing stage, but have had similar problems with records with diacritics in the 100 or 245 fields (however diacritics in a 520a field don't seem to cause any problems). Since our records are not "diacritic rich" it's hard to determine the exact extent of the problem.

I am using these versions:
  Perl v5.8.8
  MARC::Charset 0.98
  MARC::Lint 1.43
  MARC::Record 2.0
  XML::LibXML 1.66

Here's an example "bad" record (which I have minimized to just the 245 field):

marcdump test.mrc
test.mrc
LDR 00127cam a2200037   4500
245 13 _aAn Empirical Test Of The Situational Leadership® Model In Japan /
       _cRiho Yoshioka.

 Recs  Errs Filename
----- ----- --------
    1     1 test.mrc

When I run test.mrc through MARC::Lint, I get this message:

Invalid record length in record 1: Leader says 00127 bytes but it's actually 125
 Invalid length in directory for tag 245 in record 1
 field does not end in end of field character in tag 245 in record 1

When examined in vi the character in question, a Registered Sign, appears to be correctly UTF-8 encoded C2AE, and the bib Leader (position 09=a) indicates that it is Unicode encoded. I've attached the MARC record.

I noticed that when I run your record (ck245.dat) through MARC::Lint, I get the same invalid record length message:

Invalid record length in record 3: Leader says 00567 bytes but it's actually 569
 field does not end in end of field character in tag 100 in record 3
 field does not end in end of field character in tag 245 in record 3
 Invalid indicators ".10" forced to blanks in record 3 for tag 245

 field does not end in end of field character in tag 260 in record 3
 Invalid indicators ".  " forced to blanks in record 3 for tag 260

 field does not end in end of field character in tag 300 in record 3
 Invalid indicators ".  " forced to blanks in record 3 for tag 300

 field does not end in end of field character in tag 502 in record 3
 Invalid indicators ".  " forced to blanks in record 3 for tag 502

 field does not end in end of field character in tag 504 in record 3
 Invalid indicators ".  " forced to blanks in record 3 for tag 504

 field does not end in end of field character in tag 690 in record 3
 Invalid indicators ". 4" forced to blanks in record 3 for tag 690

Anybody have any ideas?

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/


-----Original Message-----
From: Shieh, Jackie [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 19, 2008 10:50 AM
To: perl4lib@perl.org
Subject: Help for utf-8 output

I was wondering if anyone has similar experience and has come
up with good solutions to help solving the challenge below?!

What I have is an Excel spreadsheet for dissertations which I
have saved as a tab delimited file (examining the file in
TextPad, the diacritics appears to be fine), then read in and
output the file as a utf-8 MARC file. I  <print> title field
confirming author field that contains diacritics with the
title showing proper indicator values.

But when I looked the MARC itself, the fields that follow the
field containing diacritics are all off its original
position. See attached zip file.  Examples below: first two
have diacritics in a 100 field, last one diacritic is in 245
subfield b)

001     diss 34001
100 1  _aP<E9>rez, Nancy L.
245     _aSynchronic and Diachronic Matlatzinkan Phonology.

001     diss 34042
100 1  _aValent<ED>n-M<E1>rquez, Wilfredo
245     _aDoing being boricua :

001     diss 33892
100 1   _aDavis, Jennifer M.
245 14 _aThe Functional Complexities of Inherited Cardiac
Troponin I Mutations :
            _bIdentification of Ca<B2>+ Independent
Contractile Dysfunction.

I would be greatly appreciate any suggestion to solve this.
Thank you most kindly.

Regards,

--Jackie

|Jackie Shieh
|Data Loads & Development
|Harlan Hatcher Graduate Library
|University of Michigan
|920 North University
|Ann Arbor, MI 48109-1205
|Phone: 734.763.6070 FAX: 734.615.9788
|E-mail: JShieh [AT] umich [DOT] edu

<test.mrc>

--------------------------------------------------
Brian Sheppard
University of Wisconsin Digital Collections Center
[EMAIL PROTECTED]    (608) 262-3349



Reply via email to