RE: MARC::Record and UTF-8 & related threads
Sorry I didn't make it clear in my original posting that the record that I modifed, using MARC::Record, DID have unicode-encoding. I didn't just change the leader/09 to try to fool it into thinking it was Unicode; the record came out of a database that had been converted to Unicode. And the 245 field had 3 characters with diacritics in it, those character+diacritic sequences did consume 2 bytes each. Anne L. Highsmith Consortia Systems Coordinator 5000 TAMU Evans Library Texas A&M University College Station, TX 77843-5000 [EMAIL PROTECTED] 979-862-4234 979-845-6238 (fax) >>> "Doran, Michael D" <[EMAIL PROTECTED]> 03/07/05 09:06AM >>> Hi Ed, > How would people feel about the next version of MARC-Record (perhaps > a v2.0) which handled utf8 properly and required a modern perl? Definitely a *good* thing. Worth upgrading Perl version for, if necessary. > Perhaps if people could respond to the list (or me if you prefer) with > the version of Perl that you use MARC::Record with I could keep > tallies and report back to the list. I have MARC::Record installed on two machines: 1) Perl 5.6.1 & MARC::Record 0.94 2) Perl 5.8.5 & MARC::Record 1.4 > > Here's my main question -- is that the principal > > concern/question/problem, i.e. that directory lengths will not be > > computed correctly using the existing MARC::Record module with a > > Unicode record? Or is it only in certain situations that > > the directory length would not be computed correctly? > > Yes, but only if the record actually contains unicode :) My understanding of Anne's posting was that the record she tested *did* contain unicode: "I started with the Unicode version of the record and modified it...". -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 cell # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -Original Message- > From: Ed Summers [mailto:[EMAIL PROTECTED] > Sent: Monday, March 07, 2005 8:37 AM > To: perl4lib@perl.org > Subject: Re: MARC::Record and UTF-8 & related threads > > On Fri, Mar 04, 2005 at 09:18:00AM -0500, Anne L. Highsmith wrote: > > Here's my main question -- is that the principal > > concern/question/problem, i.e. that directory lengths will not be > > computed correctly using the existing MARC::Record module with a > > Unicode record? Or is it only in certain situations that > the directory > > length would not be computed correctly? > > Yes, but only if the record actually contains unicode :) If you are > looking for an example of how MARC::Record breaks when there is utf8 > in the record you can look at t/utf8.t which is a test > distributed with > the MARC-Record package. Currently, this test is skipped > because otherwise > it would fail. > > > If anyone is inspired to make the necessary updates to the > MARC::Record module to handle unicode records, I'd certainly > be happy to test. I'd also be eternally grateful, since my > alternative might be re-writing 8 or 10 job streams in the > next 10 weeks so that I can: 1) export the records from my > database in MARC8; 2) edit them; 3) reload them doing a > MARC8-Unicode conversion utility provided by the lms vendor. > > I've been meaning to write to the list about this for > sometime now. How > would people feel about the next version of MARC-Record (perhaps a > v2.0) which handled utf8 properly and required a modern perl? > By modern > perl I mean a version >= 5.8.1. The reason why 5.8.1 is > required is that > it's the first perl with a byte oriented substr() (available via the > bytes pragma). > > Perhaps if people could respond to the list (or me if you prefer) with > the version of Perl that you use MARC::Record with I could > keep tallies > and report back to the list. > > //Ed >
Re: MARC::Record and UTF-8 & related threads
Thanks for the details about your Perl versions Michael. On Mon, Mar 07, 2005 at 09:06:48AM -0600, Doran, Michael D wrote: > My understanding of Anne's posting was that the record she tested *did* > contain unicode: "I started with the Unicode version of the record and > modified it...". Yeah, I didn't really understand what this specifically meant. I thought perhaps Anne had simply set position 08 in the leader to indicate the record contained utf8...which wouldn't adversely affect MARC::Record at all. The main thing is that the record should contain a multibyte character. The t/utf8.t should make the problem evident. //Ed
RE: MARC::Record and UTF-8 & related threads
Hi Ed, > How would people feel about the next version of MARC-Record (perhaps > a v2.0) which handled utf8 properly and required a modern perl? Definitely a *good* thing. Worth upgrading Perl version for, if necessary. > Perhaps if people could respond to the list (or me if you prefer) with > the version of Perl that you use MARC::Record with I could keep > tallies and report back to the list. I have MARC::Record installed on two machines: 1) Perl 5.6.1 & MARC::Record 0.94 2) Perl 5.8.5 & MARC::Record 1.4 > > Here's my main question -- is that the principal > > concern/question/problem, i.e. that directory lengths will not be > > computed correctly using the existing MARC::Record module with a > > Unicode record? Or is it only in certain situations that > > the directory length would not be computed correctly? > > Yes, but only if the record actually contains unicode :) My understanding of Anne's posting was that the record she tested *did* contain unicode: "I started with the Unicode version of the record and modified it...". -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 cell # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -Original Message- > From: Ed Summers [mailto:[EMAIL PROTECTED] > Sent: Monday, March 07, 2005 8:37 AM > To: perl4lib@perl.org > Subject: Re: MARC::Record and UTF-8 & related threads > > On Fri, Mar 04, 2005 at 09:18:00AM -0500, Anne L. Highsmith wrote: > > Here's my main question -- is that the principal > > concern/question/problem, i.e. that directory lengths will not be > > computed correctly using the existing MARC::Record module with a > > Unicode record? Or is it only in certain situations that > the directory > > length would not be computed correctly? > > Yes, but only if the record actually contains unicode :) If you are > looking for an example of how MARC::Record breaks when there is utf8 > in the record you can look at t/utf8.t which is a test > distributed with > the MARC-Record package. Currently, this test is skipped > because otherwise > it would fail. > > > If anyone is inspired to make the necessary updates to the > MARC::Record module to handle unicode records, I'd certainly > be happy to test. I'd also be eternally grateful, since my > alternative might be re-writing 8 or 10 job streams in the > next 10 weeks so that I can: 1) export the records from my > database in MARC8; 2) edit them; 3) reload them doing a > MARC8-Unicode conversion utility provided by the lms vendor. > > I've been meaning to write to the list about this for > sometime now. How > would people feel about the next version of MARC-Record (perhaps a > v2.0) which handled utf8 properly and required a modern perl? > By modern > perl I mean a version >= 5.8.1. The reason why 5.8.1 is > required is that > it's the first perl with a byte oriented substr() (available via the > bytes pragma). > > Perhaps if people could respond to the list (or me if you prefer) with > the version of Perl that you use MARC::Record with I could > keep tallies > and report back to the list. > > //Ed >
Re: MARC::Record and UTF-8 & related threads
On Fri, Mar 04, 2005 at 09:18:00AM -0500, Anne L. Highsmith wrote: > Here's my main question -- is that the principal > concern/question/problem, i.e. that directory lengths will not be > computed correctly using the existing MARC::Record module with a > Unicode record? Or is it only in certain situations that the directory > length would not be computed correctly? Yes, but only if the record actually contains unicode :) If you are looking for an example of how MARC::Record breaks when there is utf8 in the record you can look at t/utf8.t which is a test distributed with the MARC-Record package. Currently, this test is skipped because otherwise it would fail. > If anyone is inspired to make the necessary updates to the MARC::Record > module to handle unicode records, I'd certainly be happy to test. I'd also be > eternally grateful, since my alternative might be re-writing 8 or 10 job > streams in the next 10 weeks so that I can: 1) export the records from my > database in MARC8; 2) edit them; 3) reload them doing a MARC8-Unicode > conversion utility provided by the lms vendor. I've been meaning to write to the list about this for sometime now. How would people feel about the next version of MARC-Record (perhaps a v2.0) which handled utf8 properly and required a modern perl? By modern perl I mean a version >= 5.8.1. The reason why 5.8.1 is required is that it's the first perl with a byte oriented substr() (available via the bytes pragma). Perhaps if people could respond to the list (or me if you prefer) with the version of Perl that you use MARC::Record with I could keep tallies and report back to the list. //Ed
RE: MARC::Record and UTF-8
Do you know if your solution will work with older (5.8.0 or 5.6.1) Perl versions? I'm limited (on my development/home machine) to 5.8.0a (MacPerl). I might be able to test MARC::File::XML, but I'm not sure, since I am unable to compile any modules relying upon C (the Mac Programmer's Workshop (MPW) freeware crashes my machine under MacOS 9.2.2, and I haven't figured out yet how to use it in MacOS 9.1 (lack of time) [1]). As a result, I can't use expat-based XML solutions (though I haven't really explored the issue, so I may be wrong). There is a pure Perl parser, I believe, so I'll see if that works. Even so, though, I realize I may be out of luck if the MacOS itself is unable to handle Unicode (I thought 9.x implemented some Unicode handling). [1] I've checked the MacPerl module porters Web sites, but it doesn't appear any binaries have been compiled since 5.6.1. Thank you for your assistance, Bryan Baldus [EMAIL PROTECTED] [EMAIL PROTECTED] http://home.inwave.com/eija
RE: MARC::Record and UTF-8 (fwd)
>Does MarcEdit completely map MARC-8 to UTF-8? Ed, It maps all the codes found in the MARC-8 to Unicode XML mapping file -- as least, that's the source file and I've never had any complains that characters weren't being converted (save for a few characters that occasionally get used that are outside of this mapping). --Terry PS. -- I get this on digest, Jackie just happened to forward me the message early so I could respond. *** Terry Reese Oregon State University Libraries Cataloger for Networked Resources Digital Production Unit Head Oregon State University Corvallis, Or 97331 Phone: 541-737-6384 Fax: 541-737-8267 [EMAIL PROTECTED] http://oregonstate.edu/~reeset/ ** >-Original Message- >From: Jackie Shieh [mailto:[EMAIL PROTECTED] >Sent: Friday, January 07, 2005 8:18 AM >To: Reese, Terry >Subject: Re: MARC::Record and UTF-8 (fwd) > > >Terry, > >Do you want to answer this? > >--Jackie > > >-- Forwarded message -- >Date: Fri, 7 Jan 2005 08:56:12 -0600 >From: Ed Summers <[EMAIL PROTECTED]> >To: perl4lib@perl.org >Subject: Re: MARC::Record and UTF-8 > >On Fri, Jan 07, 2005 at 08:13:08AM -0500, Houghton,Andrew wrote: >> This is not a Perl solution, but if you are just looking to convert >> MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit >> program. > >Does MarcEdit completely map MARC-8 to UTF-8? > >//Ed > > > >
Re: MARC::Record and UTF-8
On Fri, 7 Jan 2005 09:59:31 -0600, Bryan Baldus <[EMAIL PROTECTED]> wrote: > Do you know if your solution will work with older (5.8.0 or 5.6.1) Perl > versions? I'm limited (on my development/home machine) to 5.8.0a (MacPerl). > I might be able to test MARC::File::XML, but I'm not sure, since I am unable > to compile any modules relying upon C (the Mac Programmer's Workshop (MPW) > freeware crashes my machine under MacOS 9.2.2, and I haven't figured out yet > how to use it in MacOS 9.1 (lack of time) [1]). As a result, I can't use > expat-based XML solutions (though I haven't really explored the issue, so I > may be wrong). There is a pure Perl parser, I believe, so I'll see if that > works. Even so, though, I realize I may be out of luck if the MacOS itself > is unable to handle Unicode (I thought 9.x implemented some Unicode > handling). I've not tested extensively with pre-5.6 Perls, but it should be fine. The Encode module is standard in modern version, but exists for 5.0 as well. The use of Encode::encode() is really to get around issues *caused* by the modern versions' unicode-ness. You can check out a longer explaination at http://open-ils.org/blog/index.php?p=14 . Any of the pure-perl XML parsers should work fine. Once the data is written out as a stream of Unicode octets it will be read correctly. But do let me know if you encounter any issues! > > [1] I've checked the MacPerl module porters Web sites, but it doesn't appear > any binaries have been compiled since 5.6.1. > > Thank you for your assistance, > > Bryan Baldus > [EMAIL PROTECTED] > [EMAIL PROTECTED] > http://home.inwave.com/eija > -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org
Re: MARC::Record and UTF-8
On Fri, 07 Jan 2005 08:53:40 +0100, Ron Davies <[EMAIL PROTECTED]> wrote: > At 07:50 7/01/2005, [EMAIL PROTECTED] wrote: > >Does anyone know of any work underway to adapt MARC::Record for utf-8 > >encoding ? I'm in the process of updating MARC::File::XML to support unicode. I was hoping to have the changes in CVS about a month ago, but I've had no time until now. Once that is done I'll look into what it will take to do the same for MARC::File::USMARC. If you'd like to look into it you'll be able to grab an updated MARC::File::XML from sourceforge's CVS some time this afternoon. I'll announce it here when I get CVS updated, and post a link to the anon cvs instructions from the project page. > > I will have a similar project in a few months' time, converting a whole > bunch of processing from MARC-8 to UTF-8. I would be very happy to assist > in testing or development of a UTF-8 capability for MARC::Record. Is the > problem listed in rt.cpan.org (http://rt.cpan.org/NoAuth/Bug.html?id=3707) > the only known issue? The way I am getting around issues like this in MARC::File::XML is to strip the utf8 flag off the data using Encode::encode(), which gives me the raw bytes in the string. In that case length works correctly, outputting to a file does not complain about wide characters, and C-based XML libraries (libxml2 in my case) see the correct data. The only issue is that you cannot use the any Unicode-aware perl functions on the strings, everything is treated as 8-bit Extended ASCII (or Latin-1, or whatever non-Unicode codepage your locale is set up for). I can't find a reason why this is actually a problem other than for locale specific sorting, which is not an issue for XML as it is only used as an input/output format; other software, usually written in C, handles actually manipulating the data. ... Not that that applies *directly* to your question ... :) -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org
RE: MARC::Record and UTF-8
> ...the ILS can be upgraded to a new version and and > people can start using Unicode, not only for Western > European languages, but also for languages like Thai. This is not really apropos to the discussion at hand, but since Thai was mentioned I thought I would contribute my two cents on an issue that perhaps not everyone is aware of... Although the ILS itself will be able to accommodate the full Unicode repertoire, according to the MARC 21 specifications, the MARC 21 UCS/Unicode environment is simply the MARC-8 character repertoire translated into the Unicode equivalent code points. One of the things that means is that characters in vernacular alphabets such as Thai are *not* valid characters in MARC 21 records. The rational behind this approach to implementing Unicode is based on the ability to translate MARC data back and forth (i.e. "round trip") between the MARC-8 and Unicode character sets [1]. Supported alphabets (and/or ideographs) are Latin, Greek, Cyrillic, Arabic, Hebrew, and East Asian (CJK) [2]. I think our ILS is fairly typical as to implementation of Unicode [3]. There is nothing stopping you from creating, storing, and displaying MARC records in Thai (or any other vernacular language) -- other than an institutional decision to adhere to the MARC 21 standard. Of course, the ILS software clients also have validation rules that can be turned on (or off, since not everyone uses MARC 21). At some point, when a large enough portion of the library world has upgraded their systems to MARC Unicode, round tripping will no longer be a constraint and the MARC 21 standard will be revised to include the full range of Unicode characters, but that is liable to be awhile. [1] Coded Character Sets > A Technical Primer for Librarians > MARC Unicode http://rocky.uta.edu/doran/charsets/unicode.html [2] An exception is the Unified Canadian Aboriginal Syllabic character set, which is not defined in MARC-8 but is permitted in the MARC UCS/Unicode environment. [3] Endeavor's Voyager - and we are scheduled for the Unicode version upgrade on Monday -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 cell # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/
RE: MARC::Record and UTF-8
> From: Ed Summers [mailto:[EMAIL PROTECTED] > Sent: 07 January, 2005 09:56 > To: perl4lib@perl.org > Subject: Re: MARC::Record and UTF-8 > > On Fri, Jan 07, 2005 at 08:13:08AM -0500, Houghton,Andrew wrote: > > This is not a Perl solution, but if you are just looking to convert > > MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit > > program. > > Does MarcEdit completely map MARC-8 to UTF-8? Yes it does. I think he uses the LC code table XML document for his conversions. The URL is: http://www.loc.gov/marc/specifications/codetables.xml which can be found off the Character Sets: Code Tables page at: http://www.loc.gov/marc/specifications/specchartables.html Andy.
Re: MARC::Record and UTF-8
On Fri, Jan 07, 2005 at 08:13:08AM -0500, Houghton,Andrew wrote: > This is not a Perl solution, but if you are just looking to convert > MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit > program. Does MarcEdit completely map MARC-8 to UTF-8? //Ed
Re: MARC::Record and UTF-8
On Fri, Jan 07, 2005 at 08:53:40AM +0100, Ron Davies wrote: > I will have a similar project in a few months' time, converting a whole > bunch of processing from MARC-8 to UTF-8. I would be very happy to assist > in testing or development of a UTF-8 capability for MARC::Record. Is the > problem listed in rt.cpan.org (http://rt.cpan.org/NoAuth/Bug.html?id=3707) > the only known issue? Correct. A few months ago I hacked at MARC::Record to try to get it to use utf8 for platforms that support perl >= 5.8. I backed out these changes because my initial implememtation proved to be faulty. Essentially I treated all data as utf8 if perl was >= 5.8 ... but this didn't work out since some valid MARC-8 data is invalid UTF-8. I was bummed. The problem (as Ron correctly points out) is that the Perl function length() is being used to construct the byte offsets in the record directory. This works fine when a character is a byte, but breaks badly on utf8 data since a character is more than one byte. Fortunately there is the bytes pragma which was introduced in 5.6 which has a bytes::length() function which computes the correct length. I belive that bytes::length() was introduced in 5.8 somewhere, it was added on later. I wanted MARC::Record to do the right thing based on position 9 in the leader. But I don't know if this is feasible. Perhaps simply having a flag when you create the MARC::Record, MARC::Batch or MARC::File::USMARC objects will be enough. my $batch = MARC::Batch( 'USMARC', 'file.dat', utf8 => 1 ); or my $record = MARC::Record->new( utf8 => 1 ); Comments, thoughts, hacks welcome :-) This shouldn't be too tough, it just needs some concentrated attention. //Ed
RE: MARC::Record and UTF-8
At 14:13 7/01/2005, Houghton,Andrew wrote: This is not a Perl solution, but if you are just looking to convert MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit program. Under its MARC Tools section it allows you to do batch conversions. Thanks, Andy. It has been a while since I looked at MarcEdit, and it's probably worth another look. But, no, it's not just a question of converting records. That will be done by my client's ILS vendor. The problem really lies in converting a bunch of regular processing that takes place in batch on new and revised bib records, to do things like add equivalent thesaurus descriptors in other languages, add special search terms, edit local fields and do some very database-specific checking. All this is implemented in Perl using MARC-8 records, and the processing has to be updated to accommodate UTF-8 before the ILS can be upgraded to a new version and people can start using Unicode, not only for Western European languages, but also for languages like Thai. Ron
RE: MARC::Record and UTF-8
>From: Ron Davies [mailto:[EMAIL PROTECTED] >Sent: Friday, January 07, 2005 2:54 AM >Subject: Re: MARC::Record and UTF-8 > >At 07:50 7/01/2005, [EMAIL PROTECTED] wrote: >>Does anyone know of any work underway to adapt MARC::Record for utf-8 >>encoding ? > >I will have a similar project in a few months' time, converting a whole bunch >of processing from MARC-8 to UTF-8. I would >be very happy to assist in >testing or development of a UTF-8 capability for MARC::Record. Is the problem >listed in This is not a Perl solution, but if you are just looking to convert MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit program. Under its MARC Tools section it allows you to do batch conversions. You can download it from: <http://oregonstate.edu/~reeset/marcedit/html/downloads.html> Andy.
Re: MARC::Record and UTF-8
At 07:50 7/01/2005, [EMAIL PROTECTED] wrote: Does anyone know of any work underway to adapt MARC::Record for utf-8 encoding ? I will have a similar project in a few months' time, converting a whole bunch of processing from MARC-8 to UTF-8. I would be very happy to assist in testing or development of a UTF-8 capability for MARC::Record. Is the problem listed in rt.cpan.org (http://rt.cpan.org/NoAuth/Bug.html?id=3707) the only known issue? Ron Ron Davies Information and documentation systems consultant Av. Baden-Powell 1 Bte 2, 1200 Brussels, Belgium Email: ron(at)rondavies.be Tel:+32 (0)2 770 33 51 GSM:+32 (0)484 502 393