On Fri, 07 Jan 2005 08:53:40 +0100, Ron Davies <[EMAIL PROTECTED]> wrote:
> At 07:50 7/01/2005, [EMAIL PROTECTED] wrote:
> >Does anyone know of any work underway to adapt MARC::Record for utf-8
> >encoding ?

I'm in the process of updating MARC::File::XML to support unicode.  I
was hoping to have the changes in CVS about a month ago, but I've had
no time until now.

Once that is done I'll look into what it will take to do the same for
MARC::File::USMARC.  If you'd like to look into it you'll be able to
grab an updated MARC::File::XML from sourceforge's CVS some time this
afternoon.  I'll announce it here when I get CVS updated, and post a
link to the anon cvs instructions from the project page.

> 
> I will have a similar project in a few months' time, converting a whole
> bunch of processing from MARC-8 to UTF-8. I would be very happy to assist
> in testing or development of a UTF-8 capability for MARC::Record. Is the
> problem listed in rt.cpan.org (http://rt.cpan.org/NoAuth/Bug.html?id=3707)
> the only known issue?

The way I am getting around issues like this in MARC::File::XML is to
strip the utf8 flag off the data using Encode::encode(), which gives
me the raw bytes in the string.  In that case length works correctly,
outputting to a file does not complain about wide characters, and
C-based XML libraries (libxml2 in my case) see the correct data.  The
only issue is that you cannot use the any Unicode-aware perl functions
on the strings, everything is treated as 8-bit Extended ASCII (or
Latin-1, or whatever non-Unicode codepage your locale is set up for). 
I can't find a reason why this is actually a problem other than for
locale specific sorting, which is not an issue for XML as it is only
used as an input/output format; other software, usually written in C,
handles actually manipulating the data.

...  Not that that applies *directly* to your question ... :)

-- 
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

Reply via email to