RE: MARC::Record and UTF-8 & related threads

2005-03-07 Thread Anne Highsmith
Sorry I didn't make it clear in my original posting that the record that
I modifed, using MARC::Record, DID have unicode-encoding. I didn't just
change the leader/09 to try to fool it into thinking it was Unicode; the
record came out of a database that had been converted to Unicode. And
the 245 field had 3 characters with diacritics in it, those
character+diacritic sequences did consume 2 bytes each.

Anne L. Highsmith
Consortia Systems Coordinator
5000 TAMU
Evans Library
Texas A&M University
College Station, TX   77843-5000
[EMAIL PROTECTED]
979-862-4234
979-845-6238 (fax)

>>> "Doran, Michael D" <[EMAIL PROTECTED]> 03/07/05 09:06AM >>>
Hi Ed,

> How would people feel about the next version of MARC-Record (perhaps
> a v2.0) which handled utf8 properly and required a modern perl? 

Definitely a *good* thing.  Worth upgrading Perl version for, if
necessary.
 
> Perhaps if people could respond to the list (or me if you prefer)
with
> the version of Perl that you use MARC::Record with I could keep
> tallies and report back to the list.

I have MARC::Record installed on two machines:
1) Perl 5.6.1 & MARC::Record 0.94
2) Perl 5.8.5 & MARC::Record 1.4

> > Here's my main question -- is that the principal
> > concern/question/problem, i.e. that directory lengths will not be
> > computed correctly using the existing MARC::Record module with a
> > Unicode record? Or is it only in certain situations that 
> > the directory length would not be computed correctly?
> 
> Yes, but only if the record actually contains unicode :)

My understanding of Anne's posting was that the record she tested
*did*
contain unicode: "I started with the Unicode version of the record and
modified it...".

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED] 
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Ed Summers [mailto:[EMAIL PROTECTED] 
> Sent: Monday, March 07, 2005 8:37 AM
> To: perl4lib@perl.org 
> Subject: Re: MARC::Record and UTF-8 & related threads
> 
> On Fri, Mar 04, 2005 at 09:18:00AM -0500, Anne L. Highsmith wrote:
> > Here's my main question -- is that the principal
> > concern/question/problem, i.e. that directory lengths will not be
> > computed correctly using the existing MARC::Record module with a
> > Unicode record? Or is it only in certain situations that 
> the directory
> > length would not be computed correctly?
> 
> Yes, but only if the record actually contains unicode :) If you are
> looking for an example of how MARC::Record breaks when there is utf8

> in the record you can look at t/utf8.t which is a test 
> distributed with
> the MARC-Record package. Currently, this test is skipped 
> because otherwise 
> it would fail.
> 
> > If anyone is inspired to make the necessary updates to the 
> MARC::Record module to handle unicode records, I'd certainly 
> be happy to test. I'd also be eternally grateful, since my 
> alternative might be re-writing 8 or 10 job streams in the 
> next 10 weeks so that I can: 1) export the records from my 
> database in MARC8; 2) edit them; 3) reload them doing a 
> MARC8-Unicode conversion utility provided by the lms vendor.
> 
> I've been meaning to write to the list about this for 
> sometime now. How
> would people feel about the next version of MARC-Record (perhaps a
> v2.0) which handled utf8 properly and required a modern perl? 
> By modern
> perl I mean a version >= 5.8.1. The reason why 5.8.1 is 
> required is that
> it's the first perl with a byte oriented substr() (available via the
> bytes pragma).
> 
> Perhaps if people could respond to the list (or me if you prefer)
with
> the version of Perl that you use MARC::Record with I could 
> keep tallies
> and report back to the list.
> 
> //Ed
> 


Re: MARC::Record and UTF-8 & related threads

2005-03-07 Thread Ed Summers
Thanks for the details about your Perl versions Michael.

On Mon, Mar 07, 2005 at 09:06:48AM -0600, Doran, Michael D wrote:
> My understanding of Anne's posting was that the record she tested *did*
> contain unicode: "I started with the Unicode version of the record and
> modified it...".

Yeah, I didn't really understand what this specifically meant. I thought
perhaps Anne had simply set position 08 in the leader to indicate the
record contained utf8...which wouldn't adversely affect MARC::Record at
all.

The main thing is that the record should contain a multibyte character.
The t/utf8.t should make the problem evident.

//Ed


RE: MARC::Record and UTF-8 & related threads

2005-03-07 Thread Doran, Michael D
Hi Ed,

> How would people feel about the next version of MARC-Record (perhaps
> a v2.0) which handled utf8 properly and required a modern perl? 

Definitely a *good* thing.  Worth upgrading Perl version for, if
necessary.
 
> Perhaps if people could respond to the list (or me if you prefer) with
> the version of Perl that you use MARC::Record with I could keep
> tallies and report back to the list.

I have MARC::Record installed on two machines:
1) Perl 5.6.1 & MARC::Record 0.94
2) Perl 5.8.5 & MARC::Record 1.4

> > Here's my main question -- is that the principal
> > concern/question/problem, i.e. that directory lengths will not be
> > computed correctly using the existing MARC::Record module with a
> > Unicode record? Or is it only in certain situations that 
> > the directory length would not be computed correctly?
> 
> Yes, but only if the record actually contains unicode :)

My understanding of Anne's posting was that the record she tested *did*
contain unicode: "I started with the Unicode version of the record and
modified it...".

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Ed Summers [mailto:[EMAIL PROTECTED] 
> Sent: Monday, March 07, 2005 8:37 AM
> To: perl4lib@perl.org
> Subject: Re: MARC::Record and UTF-8 & related threads
> 
> On Fri, Mar 04, 2005 at 09:18:00AM -0500, Anne L. Highsmith wrote:
> > Here's my main question -- is that the principal
> > concern/question/problem, i.e. that directory lengths will not be
> > computed correctly using the existing MARC::Record module with a
> > Unicode record? Or is it only in certain situations that 
> the directory
> > length would not be computed correctly?
> 
> Yes, but only if the record actually contains unicode :) If you are
> looking for an example of how MARC::Record breaks when there is utf8 
> in the record you can look at t/utf8.t which is a test 
> distributed with
> the MARC-Record package. Currently, this test is skipped 
> because otherwise 
> it would fail.
> 
> > If anyone is inspired to make the necessary updates to the 
> MARC::Record module to handle unicode records, I'd certainly 
> be happy to test. I'd also be eternally grateful, since my 
> alternative might be re-writing 8 or 10 job streams in the 
> next 10 weeks so that I can: 1) export the records from my 
> database in MARC8; 2) edit them; 3) reload them doing a 
> MARC8-Unicode conversion utility provided by the lms vendor.
> 
> I've been meaning to write to the list about this for 
> sometime now. How
> would people feel about the next version of MARC-Record (perhaps a
> v2.0) which handled utf8 properly and required a modern perl? 
> By modern
> perl I mean a version >= 5.8.1. The reason why 5.8.1 is 
> required is that
> it's the first perl with a byte oriented substr() (available via the
> bytes pragma).
> 
> Perhaps if people could respond to the list (or me if you prefer) with
> the version of Perl that you use MARC::Record with I could 
> keep tallies
> and report back to the list.
> 
> //Ed
> 


Re: MARC::Record and UTF-8 & related threads

2005-03-07 Thread Ed Summers
On Fri, Mar 04, 2005 at 09:18:00AM -0500, Anne L. Highsmith wrote:
> Here's my main question -- is that the principal
> concern/question/problem, i.e. that directory lengths will not be
> computed correctly using the existing MARC::Record module with a
> Unicode record? Or is it only in certain situations that the directory
> length would not be computed correctly?

Yes, but only if the record actually contains unicode :) If you are
looking for an example of how MARC::Record breaks when there is utf8 
in the record you can look at t/utf8.t which is a test distributed with
the MARC-Record package. Currently, this test is skipped because otherwise 
it would fail.

> If anyone is inspired to make the necessary updates to the MARC::Record 
> module to handle unicode records, I'd certainly be happy to test. I'd also be 
> eternally grateful, since my alternative might be re-writing 8 or 10 job 
> streams in the next 10 weeks so that I can: 1) export the records from my 
> database in MARC8; 2) edit them; 3) reload them doing a MARC8-Unicode 
> conversion utility provided by the lms vendor.

I've been meaning to write to the list about this for sometime now. How
would people feel about the next version of MARC-Record (perhaps a
v2.0) which handled utf8 properly and required a modern perl? By modern
perl I mean a version >= 5.8.1. The reason why 5.8.1 is required is that
it's the first perl with a byte oriented substr() (available via the
bytes pragma).

Perhaps if people could respond to the list (or me if you prefer) with
the version of Perl that you use MARC::Record with I could keep tallies
and report back to the list.

//Ed


RE: MARC::Record and UTF-8

2005-01-07 Thread Bryan Baldus
Do you know if your solution will work with older (5.8.0 or 5.6.1) Perl
versions? I'm limited (on my development/home machine) to 5.8.0a (MacPerl).
I might be able to test MARC::File::XML, but I'm not sure, since I am unable
to compile any modules relying upon C (the Mac Programmer's Workshop (MPW)
freeware crashes my machine under MacOS 9.2.2, and I haven't figured out yet
how to use it in MacOS 9.1 (lack of time) [1]). As a result, I can't use
expat-based XML solutions (though I haven't really explored the issue, so I
may be wrong). There is a pure Perl parser, I believe, so I'll see if that
works. Even so, though, I realize I may be out of luck if the MacOS itself
is unable to handle Unicode (I thought 9.x implemented some Unicode
handling).

[1] I've checked the MacPerl module porters Web sites, but it doesn't appear
any binaries have been compiled since 5.6.1.

Thank you for your assistance,

Bryan Baldus
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://home.inwave.com/eija


RE: MARC::Record and UTF-8 (fwd)

2005-01-07 Thread Reese, Terry
>Does MarcEdit completely map MARC-8 to UTF-8?
Ed, 

It maps all the codes found in the MARC-8 to Unicode XML mapping file --
as least, that's the source file and I've never had any complains that
characters weren't being converted (save for a few characters that
occasionally get used that are outside of this mapping).

--Terry

PS. -- I get this on digest, Jackie just happened to forward me the
message early so I could respond.


***
Terry Reese   
Oregon State University Libraries
Cataloger for Networked Resources
Digital Production Unit Head
Oregon State University
Corvallis, Or  97331 
Phone: 541-737-6384
Fax: 541-737-8267
[EMAIL PROTECTED]
http://oregonstate.edu/~reeset/

**


>-Original Message-
>From: Jackie Shieh [mailto:[EMAIL PROTECTED] 
>Sent: Friday, January 07, 2005 8:18 AM
>To: Reese, Terry
>Subject: Re: MARC::Record and UTF-8 (fwd)
>
>
>Terry,
>
>Do you want to answer this?
>
>--Jackie
>
>
>-- Forwarded message --
>Date: Fri, 7 Jan 2005 08:56:12 -0600
>From: Ed Summers <[EMAIL PROTECTED]>
>To: perl4lib@perl.org
>Subject: Re: MARC::Record and UTF-8
>
>On Fri, Jan 07, 2005 at 08:13:08AM -0500, Houghton,Andrew wrote:
>> This is not a Perl solution, but if you are just looking to convert
>> MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit 
>> program.
>
>Does MarcEdit completely map MARC-8 to UTF-8?
>
>//Ed
>
>
>
>


Re: MARC::Record and UTF-8

2005-01-07 Thread Mike Rylander
On Fri, 7 Jan 2005 09:59:31 -0600, Bryan Baldus
<[EMAIL PROTECTED]> wrote:
> Do you know if your solution will work with older (5.8.0 or 5.6.1) Perl
> versions? I'm limited (on my development/home machine) to 5.8.0a (MacPerl).
> I might be able to test MARC::File::XML, but I'm not sure, since I am unable
> to compile any modules relying upon C (the Mac Programmer's Workshop (MPW)
> freeware crashes my machine under MacOS 9.2.2, and I haven't figured out yet
> how to use it in MacOS 9.1 (lack of time) [1]). As a result, I can't use
> expat-based XML solutions (though I haven't really explored the issue, so I
> may be wrong). There is a pure Perl parser, I believe, so I'll see if that
> works. Even so, though, I realize I may be out of luck if the MacOS itself
> is unable to handle Unicode (I thought 9.x implemented some Unicode
> handling).

I've not tested extensively with pre-5.6 Perls, but it should be fine.
 The Encode module is standard in modern version, but exists for 5.0
as well.  The use of Encode::encode() is really to get around issues
*caused* by the modern versions' unicode-ness.  You can check out a
longer explaination at http://open-ils.org/blog/index.php?p=14 .

Any of the pure-perl XML parsers should work fine.  Once the data is
written out as a stream of Unicode octets it will be read correctly. 
But do let me know if you encounter any issues!

> 
> [1] I've checked the MacPerl module porters Web sites, but it doesn't appear
> any binaries have been compiled since 5.6.1.
> 
> Thank you for your assistance,
> 
> Bryan Baldus
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
> http://home.inwave.com/eija
> 


-- 
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org


Re: MARC::Record and UTF-8

2005-01-07 Thread Mike Rylander
On Fri, 07 Jan 2005 08:53:40 +0100, Ron Davies <[EMAIL PROTECTED]> wrote:
> At 07:50 7/01/2005, [EMAIL PROTECTED] wrote:
> >Does anyone know of any work underway to adapt MARC::Record for utf-8
> >encoding ?

I'm in the process of updating MARC::File::XML to support unicode.  I
was hoping to have the changes in CVS about a month ago, but I've had
no time until now.

Once that is done I'll look into what it will take to do the same for
MARC::File::USMARC.  If you'd like to look into it you'll be able to
grab an updated MARC::File::XML from sourceforge's CVS some time this
afternoon.  I'll announce it here when I get CVS updated, and post a
link to the anon cvs instructions from the project page.

> 
> I will have a similar project in a few months' time, converting a whole
> bunch of processing from MARC-8 to UTF-8. I would be very happy to assist
> in testing or development of a UTF-8 capability for MARC::Record. Is the
> problem listed in rt.cpan.org (http://rt.cpan.org/NoAuth/Bug.html?id=3707)
> the only known issue?

The way I am getting around issues like this in MARC::File::XML is to
strip the utf8 flag off the data using Encode::encode(), which gives
me the raw bytes in the string.  In that case length works correctly,
outputting to a file does not complain about wide characters, and
C-based XML libraries (libxml2 in my case) see the correct data.  The
only issue is that you cannot use the any Unicode-aware perl functions
on the strings, everything is treated as 8-bit Extended ASCII (or
Latin-1, or whatever non-Unicode codepage your locale is set up for). 
I can't find a reason why this is actually a problem other than for
locale specific sorting, which is not an issue for XML as it is only
used as an input/output format; other software, usually written in C,
handles actually manipulating the data.

...  Not that that applies *directly* to your question ... :)

-- 
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org


RE: MARC::Record and UTF-8

2005-01-07 Thread Doran, Michael D
> ...the ILS can be upgraded to a new version and  and
> people can start using Unicode, not only for Western
> European languages, but also for languages like Thai.

This is not really apropos to the discussion at hand, but since Thai was
mentioned I thought I would contribute my two cents on an issue that
perhaps not everyone is aware of...  

Although the ILS itself will be able to accommodate the full Unicode
repertoire, according to the MARC 21 specifications, the MARC 21
UCS/Unicode environment is simply the MARC-8 character repertoire
translated into the Unicode equivalent code points.  One of the things
that means is that characters in vernacular alphabets such as Thai are
*not* valid characters in MARC 21 records.  The rational behind this
approach to implementing Unicode is based on the ability to translate
MARC data back and forth (i.e. "round trip") between the MARC-8 and
Unicode character sets [1].  Supported alphabets (and/or ideographs) are
Latin, Greek, Cyrillic, Arabic, Hebrew, and East Asian (CJK) [2].

I think our ILS is fairly typical as to implementation of Unicode [3].
There is nothing stopping you from creating, storing, and displaying
MARC records in Thai (or any other vernacular language) -- other than an
institutional decision to adhere to the MARC 21 standard.  Of course,
the ILS software clients also have validation rules that can be turned
on (or off, since not everyone uses MARC 21).

At some point, when a large enough portion of the library world has
upgraded their systems to MARC Unicode, round tripping will no longer be
a constraint and the MARC 21 standard will be revised to include the
full range of Unicode characters, but that is liable to be awhile.

[1] Coded Character Sets > A Technical Primer for Librarians > MARC
Unicode 
http://rocky.uta.edu/doran/charsets/unicode.html

[2] An exception is the Unified Canadian Aboriginal Syllabic character
set, which is not defined in MARC-8 but is permitted in the MARC
UCS/Unicode environment. 

[3] Endeavor's Voyager - and we are scheduled for the Unicode version
upgrade on Monday

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 


RE: MARC::Record and UTF-8

2005-01-07 Thread Houghton,Andrew
> From: Ed Summers [mailto:[EMAIL PROTECTED] 
> Sent: 07 January, 2005 09:56
> To: perl4lib@perl.org
> Subject: Re: MARC::Record and UTF-8
> 
> On Fri, Jan 07, 2005 at 08:13:08AM -0500, Houghton,Andrew wrote:
> > This is not a Perl solution, but if you are just looking to convert
> > MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit 
> > program.
> 
> Does MarcEdit completely map MARC-8 to UTF-8?

Yes it does.  I think he uses the LC code table XML document for his
conversions.  The URL is:

http://www.loc.gov/marc/specifications/codetables.xml

which can be found off the Character Sets: Code Tables page at:

http://www.loc.gov/marc/specifications/specchartables.html


Andy.


Re: MARC::Record and UTF-8

2005-01-07 Thread Ed Summers
On Fri, Jan 07, 2005 at 08:13:08AM -0500, Houghton,Andrew wrote:
> This is not a Perl solution, but if you are just looking to convert 
> MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit 
> program.  

Does MarcEdit completely map MARC-8 to UTF-8?

//Ed


Re: MARC::Record and UTF-8

2005-01-07 Thread Ed Summers
On Fri, Jan 07, 2005 at 08:53:40AM +0100, Ron Davies wrote:
> I will have a similar project in a few months' time, converting a whole 
> bunch of processing from MARC-8 to UTF-8. I would be very happy to assist 
> in testing or development of a UTF-8 capability for MARC::Record. Is the 
> problem listed in rt.cpan.org (http://rt.cpan.org/NoAuth/Bug.html?id=3707) 
> the only known issue?

Correct. A few months ago I hacked at MARC::Record to try to get it to
use utf8 for platforms that support perl >= 5.8.

I backed out these changes because my initial implememtation proved to
be faulty. Essentially I treated all data as utf8 if perl was >= 5.8
... but this didn't work out since some valid MARC-8 data is invalid
UTF-8. I was bummed. 

The problem (as Ron correctly points out) is that the Perl function length() 
is being used to construct the byte offsets in the record directory. This 
works fine when a character is a byte, but breaks badly on utf8 data since a 
character is more than one byte.

Fortunately there is the bytes pragma which was introduced in 5.6 which
has a bytes::length() function which computes the correct length. I
belive that bytes::length() was introduced in 5.8 somewhere, it was
added on later.

I wanted MARC::Record to do the right thing based on position 9 in the
leader. But I don't know if this is feasible. Perhaps simply having a
flag when you create the MARC::Record, MARC::Batch or MARC::File::USMARC
objects will be enough.

my $batch = MARC::Batch( 'USMARC', 'file.dat', utf8 => 1 );

or

my $record = MARC::Record->new( utf8 => 1 );

Comments, thoughts, hacks welcome :-) This shouldn't be too tough, it
just needs some concentrated attention.

//Ed


RE: MARC::Record and UTF-8

2005-01-07 Thread Ron Davies
 At 14:13 7/01/2005, Houghton,Andrew wrote:
This is not a Perl solution, but if you are just looking to convert
MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit
program.  Under its MARC Tools section it allows you to do batch
conversions.
Thanks, Andy. It has been a while since I looked at MarcEdit, and it's 
probably worth another look.

But, no, it's not just a question of converting records. That will be done 
by my client's ILS vendor. The problem really lies in converting a bunch of 
regular processing that takes place in batch on new and revised bib 
records, to do things like add equivalent thesaurus descriptors in other 
languages, add special search terms, edit local fields and do some very 
database-specific checking. All this is implemented in Perl using MARC-8 
records, and the processing has to be updated to accommodate UTF-8 before 
the ILS can be upgraded to a new version and people can start using 
Unicode, not only for Western European languages, but also for languages 
like Thai.

Ron



RE: MARC::Record and UTF-8

2005-01-07 Thread Houghton,Andrew
>From: Ron Davies [mailto:[EMAIL PROTECTED] 
>Sent: Friday, January 07, 2005 2:54 AM
>Subject: Re: MARC::Record and UTF-8
>
>At 07:50 7/01/2005, [EMAIL PROTECTED] wrote:
>>Does anyone know of any work underway to adapt MARC::Record for utf-8 
>>encoding ?
>
>I will have a similar project in a few months' time, converting a whole bunch 
>of processing from MARC-8 to UTF-8. I would >be very happy to assist in 
>testing or development of a UTF-8 capability for MARC::Record. Is the problem 
>listed in

This is not a Perl solution, but if you are just looking to convert 
MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit 
program.  Under its MARC Tools section it allows you to do batch 
conversions.  You can download it from:

<http://oregonstate.edu/~reeset/marcedit/html/downloads.html>


Andy.





Re: MARC::Record and UTF-8

2005-01-06 Thread Ron Davies
At 07:50 7/01/2005, [EMAIL PROTECTED] wrote:
Does anyone know of any work underway to adapt MARC::Record for utf-8
encoding ?
I will have a similar project in a few months' time, converting a whole 
bunch of processing from MARC-8 to UTF-8. I would be very happy to assist 
in testing or development of a UTF-8 capability for MARC::Record. Is the 
problem listed in rt.cpan.org (http://rt.cpan.org/NoAuth/Bug.html?id=3707) 
the only known issue?

Ron
Ron Davies
Information and documentation systems consultant
Av. Baden-Powell 1  Bte 2, 1200 Brussels, Belgium
Email:  ron(at)rondavies.be
Tel:+32 (0)2 770 33 51
GSM:+32 (0)484 502 393