MARC::Record and UTF-8 Perl version

2005-03-08 Thread Ian . Hamilton
Hi Ed,

 How would people feel about the next version of MARC-Record (perhaps
 a v2.0) which handled utf8 properly and required a modern perl? 

Entirely agree with Michael Doran: Definitely a *good* thing.  
 
 Perhaps if people could respond to the list (or me if you prefer) with
 the version of Perl that you use MARC::Record with I could keep
 tallies and report back to the list.

- I am currently using MARC::Record 1.34 with Perl 5.6.0

- I'll soon be migrating to another machine running Aleph with utf8 data
with Perl 5.8.2
  I will install the latest stable version of MARC::Record on this machine.

Regards, Ian
_
Ian Hamilton 
Library Systems Administrator
European Commission - Directorate General for Education and Culture 
EAC C4 - Central Library Unit 
* +32-2-295.24.60 (direct phone) 
* +32-2-299.91.89 (fax)
 http://europa.eu.int/comm/dgs/education_culture/index_en.htm
   http://europa.eu.int/comm/libraries/index.htm
 http://europa.eu.int/eclas/


RE: MARC::Record and UTF-8

2005-01-07 Thread Houghton,Andrew
From: Ron Davies [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 07, 2005 2:54 AM
Subject: Re: MARC::Record and UTF-8

At 07:50 7/01/2005, [EMAIL PROTECTED] wrote:
Does anyone know of any work underway to adapt MARC::Record for utf-8 
encoding ?

I will have a similar project in a few months' time, converting a whole bunch 
of processing from MARC-8 to UTF-8. I would be very happy to assist in 
testing or development of a UTF-8 capability for MARC::Record. Is the problem 
listed in

This is not a Perl solution, but if you are just looking to convert 
MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit 
program.  Under its MARC Tools section it allows you to do batch 
conversions.  You can download it from:

http://oregonstate.edu/~reeset/marcedit/html/downloads.html


Andy.





Re: MARC::Record and UTF-8

2005-01-07 Thread Ed Summers
On Fri, Jan 07, 2005 at 08:53:40AM +0100, Ron Davies wrote:
 I will have a similar project in a few months' time, converting a whole 
 bunch of processing from MARC-8 to UTF-8. I would be very happy to assist 
 in testing or development of a UTF-8 capability for MARC::Record. Is the 
 problem listed in rt.cpan.org (http://rt.cpan.org/NoAuth/Bug.html?id=3707) 
 the only known issue?

Correct. A few months ago I hacked at MARC::Record to try to get it to
use utf8 for platforms that support perl = 5.8.

I backed out these changes because my initial implememtation proved to
be faulty. Essentially I treated all data as utf8 if perl was = 5.8
... but this didn't work out since some valid MARC-8 data is invalid
UTF-8. I was bummed. 

The problem (as Ron correctly points out) is that the Perl function length() 
is being used to construct the byte offsets in the record directory. This 
works fine when a character is a byte, but breaks badly on utf8 data since a 
character is more than one byte.

Fortunately there is the bytes pragma which was introduced in 5.6 which
has a bytes::length() function which computes the correct length. I
belive that bytes::length() was introduced in 5.8 somewhere, it was
added on later.

I wanted MARC::Record to do the right thing based on position 9 in the
leader. But I don't know if this is feasible. Perhaps simply having a
flag when you create the MARC::Record, MARC::Batch or MARC::File::USMARC
objects will be enough.

my $batch = MARC::Batch( 'USMARC', 'file.dat', utf8 = 1 );

or

my $record = MARC::Record-new( utf8 = 1 );

Comments, thoughts, hacks welcome :-) This shouldn't be too tough, it
just needs some concentrated attention.

//Ed


RE: MARC::Record and UTF-8

2005-01-07 Thread Houghton,Andrew
 From: Ed Summers [mailto:[EMAIL PROTECTED] 
 Sent: 07 January, 2005 09:56
 To: perl4lib@perl.org
 Subject: Re: MARC::Record and UTF-8
 
 On Fri, Jan 07, 2005 at 08:13:08AM -0500, Houghton,Andrew wrote:
  This is not a Perl solution, but if you are just looking to convert
  MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit 
  program.
 
 Does MarcEdit completely map MARC-8 to UTF-8?

Yes it does.  I think he uses the LC code table XML document for his
conversions.  The URL is:

http://www.loc.gov/marc/specifications/codetables.xml

which can be found off the Character Sets: Code Tables page at:

http://www.loc.gov/marc/specifications/specchartables.html


Andy.


Re: MARC::Record and UTF-8

2005-01-06 Thread Ron Davies
At 07:50 7/01/2005, [EMAIL PROTECTED] wrote:
Does anyone know of any work underway to adapt MARC::Record for utf-8
encoding ?
I will have a similar project in a few months' time, converting a whole 
bunch of processing from MARC-8 to UTF-8. I would be very happy to assist 
in testing or development of a UTF-8 capability for MARC::Record. Is the 
problem listed in rt.cpan.org (http://rt.cpan.org/NoAuth/Bug.html?id=3707) 
the only known issue?

Ron
Ron Davies
Information and documentation systems consultant
Av. Baden-Powell 1  Bte 2, 1200 Brussels, Belgium
Email:  ron(at)rondavies.be
Tel:+32 (0)2 770 33 51
GSM:+32 (0)484 502 393