Re: MARC::Record and UTF-8

Ed Summers Fri, 07 Jan 2005 06:54:47 -0800

On Fri, Jan 07, 2005 at 08:53:40AM +0100, Ron Davies wrote:
> I will have a similar project in a few months' time, converting a whole 
> bunch of processing from MARC-8 to UTF-8. I would be very happy to assist 
> in testing or development of a UTF-8 capability for MARC::Record. Is the 
> problem listed in rt.cpan.org (http://rt.cpan.org/NoAuth/Bug.html?id=3707) 
> the only known issue?


Correct. A few months ago I hacked at MARC::Record to try to get it to
use utf8 for platforms that support perl >= 5.8.

I backed out these changes because my initial implememtation proved to
be faulty. Essentially I treated all data as utf8 if perl was >= 5.8
... but this didn't work out since some valid MARC-8 data is invalid
UTF-8. I was bummed. 

The problem (as Ron correctly points out) is that the Perl function length() 
is being used to construct the byte offsets in the record directory. This 
works fine when a character is a byte, but breaks badly on utf8 data since a 
character is more than one byte.

Fortunately there is the bytes pragma which was introduced in 5.6 which
has a bytes::length() function which computes the correct length. I
belive that bytes::length() was introduced in 5.8 somewhere, it was
added on later.

I wanted MARC::Record to do the right thing based on position 9 in the
leader. But I don't know if this is feasible. Perhaps simply having a
flag when you create the MARC::Record, MARC::Batch or MARC::File::USMARC
objects will be enough.

    my $batch = MARC::Batch( 'USMARC', 'file.dat', utf8 => 1 );

or

    my $record = MARC::Record->new( utf8 => 1 );

Comments, thoughts, hacks welcome :-) This shouldn't be too tough, it
just needs some concentrated attention.

//Ed

Re: MARC::Record and UTF-8

Reply via email to