RE: yet another character encoding question

Doran, Michael D Thu, 29 Sep 2005 07:44:10 -0700

Hi Jason,
 
I believe that MARC::Charset only does MARC-8 to UTF-8 conversion and vice 
versa, so won't be a solution for automating your Latin-1 to MARC-8 conversion, 
unless you were planning to do Latin-1=>UTF-8=>MARC-8.
 
A few years ago, I wrote an imperfect MARC-8 to Latin-1 character set 
conversion routine [1].  If you can't find any off-the-shelf solution, it may 
serve as a basis for writing a Latin-1 to MARC-8 conversion routine.  Because 
MARC-8 is only really used in "library land" and is somewhat complex, I found 
few available open-source conversion routines (this was before Ed Summers wrote 
MARC::Charset), which is why I wrote my own.
 
> During the test install, it says it requires the module DB_File,
> and during the test install of that, it fails 
 
I believe that Berkeley DB is a prerequisite.
 
-- Michael Doran
 
[1] MARC to Latin: A charset conversion routine in Perl
    http://rocky.uta.edu/doran/charset/

________________________________

From: Thomale, J [mailto:[EMAIL PROTECTED]
Sent: Thu 9/29/2005 8:59 AM
To: perl4lib@perl.org
Subject: yet another character encoding question

Hello all,

I'm brand new to this list, and I need some help with a particular
issue. I searched through the mailing list archives but didn't find
anything directly addressing this--despite the seeming popularity of
questions about character sets--so I thought I'd ask.

I've written a perl script that extracts data from a MySQL database,
uses MARC::Record to map that data to MARC, and outputs the MARC record
(based on a script written by Brian Surratt of Texas A&M University).
The resulting records need to have all data encoded in MARC-8 format
(for loading into OCLC and into our local catalog). The data in the
MySQL database is encoded using ISO 8859-1 (latin-1). The MARC records
output by the script work fine so long as they don't contain diacritics
(or other weird stuff). When they do contain diacritics, those
diacritics come out incorrectly when the MARC record is read by a
program expecting MARC-8 (because the diacritics are encoded in
latin-1).

So, is there an easy way to translate from latin-1 encoding to
MARC-8/ANSEL? I've been unable to find any perl modules that help me
with this outside of MARC::Charset. Unfortunately, we're having trouble
getting that module installed on our machine. During the test install,
it says it requires the module DB_File, and during the test install of
that, it fails (not sure what the error message is--I'd have to ask the
admin of that machine). We're running Perl v5.8.3.

FWIW, I did try manually searching/replacing diacritics in the extracted
database fields before converting to MARC and it worked fine (I tried it
on a record that contained Spanish, so there were limited characters
that applied). In order for this approach to be viable, I'd have to map
ALL the latin-1 characters to their MARC-8 counterparts, which would be
a time-consuming process.

On top of this, there are a few records containing the characters hex EF
BF BD, which is the UTF-8 replacement character. I'm a bit mystified as
to where this is coming from, and it would be trivial enough to simply
strip it out, but this approach doesn't guarantee that the script will
catch all non-MARC-8 characters. That's why I'd really prefer to use
MARC::Charset for this--it needs to be robust enough that I won't have
to baby-sit it all the time.

So, I suppose my question is two-fold. 1. Has anyone had similar
problems getting MARC::Charset installed? Could you offer any advice
that I can pass along as to how to get it installed? 2. Are there any
other perl modules that will convert latin-1 to MARC-8/ANSEL?

Thanks in advance for any help you can offer.

Jason Thomale
Metadata Librarian
Texas Tech University Libraries
(806) 742-2240

RE: yet another character encoding question

Reply via email to