RE: MARC::Charset 'utf8_to_marc8'

Doran, Michael D Tue, 18 Sep 2007 11:41:07 -0700

Hi Laurence,

> I'm trying to create MARC records from serials data exported 
> from SFX, using  MARC::Charset version 0.98 to convert UTF-8 
> strings to MARC-8. It seems to be failing on extended latin 
> characters like U+00C5 CAPITAL LETTER A WITH RING ABOVE


The encoding, U+00C5 (CAPITAL LETTER A WITH RING ABOVE), is a precomposed 
character [1].  While U+00C5 is a perfectly good Unicode encoding, I believe 
that it is still the recommended practice for Unicode-encoded MARC-21 records 
to use base and combining characters, and U+00C5 doesn't have a direct 
equivalent in the MARC-21 repertoire [2,3].

If the strings are first normalized using Unicode Normalization Form D, they 
should convert okay [4,5].  

> The records convert using Terry Reese's MarcEdit OK.

Perhaps MarcEdit incorporates the decomposition or has direct conversion of 
precomposed Unicode to decomposed MARC-8.

-- Michael 

[1] The decomposition (i.e. base and combining character) values for "CAPITAL 
LETTER A WITH RING ABOVE" would be U+0041 (LATIN CAPITAL LETTER A) followed by 
U+030A (COMBINING RING ABOVE).

[2] WORKING PRINCIPLES TO BE FOLLOWED IN MAPPING OF CHARACTERS FROM USMARC TO 
UNICODE/UCS

   * Accented letters ... will continue to be encoded as a base letter
     and non-spacing marks. Use of precomposed accented letters is not
     sanctioned at this stage.

    From "USMARC Character Set Issues and Mapping to Unicode/UCS"
    http://www.loc.gov/marc/marbi/1996/96-10.html 

[3] MARC 21 Specifications > CHARACTER SETS > Code Tables
    http://www.loc.gov/marc/specifications/specchartables.html

[4] Preprocessing Requirements

    ... preprocessing of the Unicode record before the conversion to
    MARC-8 takes place. In all of the above techniques, the following
    steps for decomposing diacritics were presumed.

    Decompose the precomposed base character/character modifier combinations
    using Unicode Normalization Form D (NFD) which produces exact equivalents,
    and primarily applies decomposition to precomposed characters with 
diacritics.

    From "Technique for conversion of Unicode to MARC-8"
    http://www.loc.gov/marc/marbi/2006/2006-04.html

[5] W3C > Charlint - A Character Normalization Tool
    http://www.w3.org/International/charlint/

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -----Original Message-----
> From: Laurence Lockton [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, September 18, 2007 5:21 AM
> To: perl4lib@perl.org
> Subject: MARC::Charset 'utf8_to_marc8'
> 
> Hi,
> 
> I'm trying to create MARC records from serials data exported 
> from SFX, using  MARC::Charset version 0.98 to convert UTF-8 
> strings to MARC-8. It seems to be failing on extended latin 
> characters like U+00C5 CAPITAL LETTER A WITH RING ABOVE, 
> giving "no mapping found at position 176" for example. 
> The records convert using Terry Reese's MarcEdit OK. Am I 
> doing the wrong thing? Any advice gratefully received.
> 
> Many thanks,
> Laurence Lockton
> University of Bath
> UK
>

RE: MARC::Charset 'utf8_to_marc8'

Reply via email to