I'd suggest you first make sure your XML is really UTF-8, using JHOVE:
/path/to/jhove/jhove -c /path/to/jhove/conf/jhove.conf -m utf8-hul
myFile.xml
If it fails you could convert to utf8, on the (perhaps unwarranted)
assumption it's windows latin1:
iconv -c -f windows-1252 -t UTF-8 myFile.xml > myFile.utf8.xml
Then, of course, test myFile.utf8.xml with jhove to see if it's valid.
-Brian
On February 21, at 11:48 AM, Doran, Michael D wrote:
Hi Jackie,
I'm working on a very similar problem... converting theses/
dissertations records (in XML) to MARC records. I'm still in the
testing stage, but have had similar problems with records with
diacritics in the 100 or 245 fields (however diacritics in a 520a
field don't seem to cause any problems). Since our records are not
"diacritic rich" it's hard to determine the exact extent of the
problem.
I am using these versions:
Perl v5.8.8
MARC::Charset 0.98
MARC::Lint 1.43
MARC::Record 2.0
XML::LibXML 1.66
Here's an example "bad" record (which I have minimized to just the
245 field):
marcdump test.mrc
test.mrc
LDR 00127cam a2200037 4500
245 13 _aAn Empirical Test Of The Situational Leadership® Model In
Japan /
_cRiho Yoshioka.
Recs Errs Filename
----- ----- --------
1 1 test.mrc
When I run test.mrc through MARC::Lint, I get this message:
Invalid record length in record 1: Leader says 00127 bytes but
it's actually 125
Invalid length in directory for tag 245 in record 1
field does not end in end of field character in tag 245 in record 1
When examined in vi the character in question, a Registered Sign,
appears to be correctly UTF-8 encoded C2AE, and the bib Leader
(position 09=a) indicates that it is Unicode encoded. I've
attached the MARC record.
I noticed that when I run your record (ck245.dat) through
MARC::Lint, I get the same invalid record length message:
Invalid record length in record 3: Leader says 00567 bytes but
it's actually 569
field does not end in end of field character in tag 100 in record 3
field does not end in end of field character in tag 245 in record 3
Invalid indicators ".10" forced to blanks in record 3 for tag 245
field does not end in end of field character in tag 260 in record 3
Invalid indicators ". " forced to blanks in record 3 for tag 260
field does not end in end of field character in tag 300 in record 3
Invalid indicators ". " forced to blanks in record 3 for tag 300
field does not end in end of field character in tag 502 in record 3
Invalid indicators ". " forced to blanks in record 3 for tag 502
field does not end in end of field character in tag 504 in record 3
Invalid indicators ". " forced to blanks in record 3 for tag 504
field does not end in end of field character in tag 690 in record 3
Invalid indicators ". 4" forced to blanks in record 3 for tag 690
Anybody have any ideas?
-- Michael
# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
-----Original Message-----
From: Shieh, Jackie [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 19, 2008 10:50 AM
To: perl4lib@perl.org
Subject: Help for utf-8 output
I was wondering if anyone has similar experience and has come
up with good solutions to help solving the challenge below?!
What I have is an Excel spreadsheet for dissertations which I
have saved as a tab delimited file (examining the file in
TextPad, the diacritics appears to be fine), then read in and
output the file as a utf-8 MARC file. I <print> title field
confirming author field that contains diacritics with the
title showing proper indicator values.
But when I looked the MARC itself, the fields that follow the
field containing diacritics are all off its original
position. See attached zip file. Examples below: first two
have diacritics in a 100 field, last one diacritic is in 245
subfield b)
001 diss 34001
100 1 _aP<E9>rez, Nancy L.
245 _aSynchronic and Diachronic Matlatzinkan Phonology.
001 diss 34042
100 1 _aValent<ED>n-M<E1>rquez, Wilfredo
245 _aDoing being boricua :
001 diss 33892
100 1 _aDavis, Jennifer M.
245 14 _aThe Functional Complexities of Inherited Cardiac
Troponin I Mutations :
_bIdentification of Ca<B2>+ Independent
Contractile Dysfunction.
I would be greatly appreciate any suggestion to solve this.
Thank you most kindly.
Regards,
--Jackie
|Jackie Shieh
|Data Loads & Development
|Harlan Hatcher Graduate Library
|University of Michigan
|920 North University
|Ann Arbor, MI 48109-1205
|Phone: 734.763.6070 FAX: 734.615.9788
|E-mail: JShieh [AT] umich [DOT] edu
<test.mrc>
--------------------------------------------------
Brian Sheppard
University of Wisconsin Digital Collections Center
[EMAIL PROTECTED] (608) 262-3349