On 3/7/07, Bryan Baldus <[EMAIL PROTECTED]> wrote:
On Wednesday, March 07, 2007 2:34 PM, Ron Davies wrote: >When I do this I get a number of error messages such as : >"\x{00ce}" does not map to utf8 at myprogram.pl line xxx. >and in the output file instead of the correct character there is a hex >encoding. This happens with Greek but also perfectly ordinary Latin >characters.I can't offer any advice, but I am experiencing what may be similar difficulties. I finally had a chance to get MARC::Charset and MARC::File::XML installed and working, so I could try out xml2marc and marc2xml. After creating a test record containing a field with diacritics, I tried using marc2xml followed by xml2marc, hoping to end up with records matching the original. marc2xml appears to have successfully translated the raw MARC into MARCXML (it left the leader unchanged--no update to the record length (though it did set byte 9 to 'a' for Unicode). Unfortunately, attempting to use xml2marc on any of the .xml files I have results in an empty file. In some cases I get a message:
Two things here, 1) there will be an new version of MARC::Charset out soon-ish which is more forgiving and has mechanisms for dealing with random (identifiable) encodings and 2) I'm not sure that the leader's record-length field means anything in the context of MARCXML ... but if anyone can think of some semantics for that I'll gladly implement it.
"Cannot decode string with wide characters at C:/Perl/lib/Encode.pm line 184, <GEN1> line 1." In other cases, I get no error messages, but still have an empty file. I have tried a number of variations in the starting file: marc8.mrc->utf8.xml; utf8.mrc->utf8.xml, MarcEdit-produced .xml->Perl-produced .mrc. My system: Windows XP; ActivePerl v5.8.2 built for MSWin32-x86-multi-thread (Binary build 808) MARC::Record: 2.0 Encode: 1.9801 Are these problems related to the age of my Perl or Encode?
This is almost certainly related to the issue that Josh has seen with, um, sub-par SAX parsers. He may be able to shed a little more light on that, as I use the LibXML parser exclusively (and I've never had issues getting utf-8 out...). Josh? (he's currently on a plane, so it may be tomorrow...)
(If I remember correctly, before switching to MARC::Record 2.0, using MARC::Record 1.39_1 and xml2marc resulted in records being output but the field containing diacritics was mangled/deleted/replaced with bad data.) Thank you for your assistance, Bryan Baldus [EMAIL PROTECTED] [EMAIL PROTECTED] http://home.inwave.com/eija
-- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org
