Hi Henri, > Is there a reason why MARC::File::XML considers only a very > strict subset of utf-8 as valid ?
I would guess that it has to do with adhering to the MARC-21 repertoire of characters, so as to facilitate the round-trip conversion between the MARC-8 and Unicode character sets [1,2]. At some point in the future the MARC-21 repertoire will be decoupled from what was defined for MARC-8. > For instance no linebreak... Control characters such as line breaks are a bit of a different issue. The MARC-21 standard currently allows for only a handful of control characters, not including (as you have discovered) the line break [3]. > This could be a really BIG trouble for kanjis or hindu languages imho. The MARC-21 repertoire of characters includes East Asian Ideographs (Han), Japanese Hiranga and Katakana, and Korean Hangul [4,5]. I don't believe that Indic scripts in the vernacular would be valid MARC-21 characters. Are you finding any cases where the Marc::File::XML parser is dropping valid MARC-21 characters? -- Michael [1] USMARC Character Set Issues and Mapping to Unicode/UCS http://www.loc.gov/marc/marbi/1996/96-10.html WORKING PRINCIPLES TO BE FOLLOWED IN MAPPING OF CHARACTERS FROM USMARC TO UNICODE/UCS The following Working Principles were established by the Subcommittee and continue to inform their mapping decisions: * Round-trip mapping will be provided between USMARC characters and Unicode/UCS characters wherever possible. [2] MARC 21 Specifications > CHARACTER SETS: Part 2 UCS/Unicode Environment http://www.loc.gov/marc/specifications/speccharucs.html "The specifications are built around enabling round trip movement of MARC data between MARC-8 and UCS/Unicode with as little loss as possible." [3] MARC-8 Unicode Character ------ ------- --------- 0x1B U+001B ESCAPE 0x1D U+001D RECORD TERMINATOR 0x1E U+001E FIELD TERMINATOR 0x1F U+001F SUBFIELD DELIMITER 0x88 U+0098 NON-SORT BEGIN 0x89 U+009C NON-SORT END 0x8D U+200D JOINER 0x8E U+200C NON-JOINER [4] MARC 21 Specifications > CHARACTER SETS: Part 3 Code Tables http://www.loc.gov/marc/specifications/specchartables.html [5] MARC 21 Standard - UCS/Unicode Environment > Character Set Mappings http://rocky.uta.edu/doran/charsets/marcU.html # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Henri-Damien LAURENT [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 26, 2007 10:45 AM > To: perl4lib > Subject: MARC::File::XML and parsing. > > hi, > I have some problems with Marc::File::XML parser. > > Take those two xml records. > Despite the fact that I agree that there are odd characters > in some subfields. > I am wondering why, since those characters are UTF8, > MARC::File::XML should drop them when parsing. > Is there a reason why MARC::File::XML considers only a very > strict subset of utf-8 as valid ? (For instance no linebreak, > no ...) ? > > Couldnot it say "OK It is XML record, encoded UTF8, i take > it for granted and no matter if there are "odd" characters" ? > This could be a really BIG trouble for kanjis or hindu languages imho. > > > <?xml version="1.0" encoding="UTF-8"?> > <record > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:schemaLocation="http://www.loc.gov/MARC21/slim > http://www.loc.gov/ standards/marcxml/schema/MARC21slim.xsd" > xmlns="http://www.loc.gov/MARC21/slim"> > > <leader>00150nx a2200073 4500 </leader> > <datafield tag="200" ind1=" " ind2="1"> > <subfield code="a">Nicolas</subfield> > <subfield code="b">Jérôme</subfield> > <subfield code="4">Traducteur</subfield> </datafield> > <datafield tag="100" ind1=" " ind2=" "> > <subfield code="a">19980124afrey50 ba0</subfield> > </datafield> > <controlfield tag="001">3568</controlfield> <datafield > tag="152" ind1=" " ind2=" "> > <subfield code="b">NP</subfield> > </datafield> > </record> > <?xml version="1.0" encoding="UTF-8"?> > <record > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:schemaLocation="http://www.loc.gov/MARC21/slim > http://www.loc.gov/ standards/marcxml/schema/MARC21slim.xsd" > xmlns="http://www.loc.gov/MARC21/slim"> > > <leader>00151nx a2200073 4500 </leader> > <datafield tag="200" ind1=" " ind2="1"> > <subfield code="a">Guynemer</subfield> > <subfield code="b">Georges</subfield> > <subfield code="f">(1894-1917)</subfield> </datafield> > <datafield tag="100" ind1=" " ind2=" "> > <subfield code="a">19980129afrey50 ba0</subfield> > </datafield> > <controlfield tag="001">4642</controlfield> <datafield > tag="152" ind1=" " ind2=" "> > <subfield code="b">NP</subfield> > </datafield> > </record> > > -- > Henri Damien LAURENT et Paul POULAIN > Consultants indépendants > en logiciels libres et bibliothéconomie (http://www.koha-fr.org) > >