Hi Michael, > An example is the author (personal name) of the book that can > be found at http://catalog.loc.gov/ by searching for ISBN > 5040039875 (I'm guessing the fact that the website appears to > be displaying a corrupted name may be part of the problem here).
The Library of Congress catalog is outputting the MARC data to your browser in Unicode UTF-8 and it looks correct to me. It may *appear* corrupted, depending on what font you choose to display the encoding (try Arial Unicode MS if you are in a Windows environment). > This name is 'Dontsova, Daria' (approximately), Below is the UTF-16 encoding of the name in question, based on a copy-and-paste directly from the browser (http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?BBID=12550873). U+0044 LATIN CAPITAL LETTER D U+006F LATIN SMALL LETTER O U+006E LATIN SMALL LETTER N U+0074 LATIN SMALL LETTER T U+FE20 LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF U+0073 LATIN SMALL LETTER S U+FE21 LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF U+006F LATIN SMALL LETTER O U+0076 LATIN SMALL LETTER V U+0061 LATIN SMALL LETTER A U+002C COMMA U+0020 SPACE, BLANK / SPACE U+0044 LATIN CAPITAL LETTER D U+0061 LATIN SMALL LETTER A U+0072 LATIN SMALL LETTER R U+02B9 SOFT SIGN, PRIME / MODIFIER LETTER PRIME U+0069 LATIN SMALL LETTER I U+FE20 LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF U+0061 LATIN SMALL LETTER A U+FE21 LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF U+002E PERIOD, DECIMAL POINT / FULL STOP > ... in hex: > 446f6eeb74ec736f76612c20446172a7eb69ec612e. > When transcoded by marc8_to_utf8() the result is > 446f6e74cda173006f76612c20446172cab969cda161002e > - which contains 2 null (00) characters. 44 6f 6e [eb] 74 [ec] 73 6f 76 61 2c 20 44 61 72 [a7] [eb] 69 [ec] 61 2e 44 6f 6e 74 [cd a1] 73 [00] 6f 76 61 2c 20 44 61 72 [ca b9] 69 [cd a1] 61 [00] 2e Hmmmm. It looks like the MARC-8 'COMBINING LIGATURE LEFT HALF' ("0xEB") and/or the MARC-8 'COMBINING LIGATURE RIGHT HALF' ("0xEC") got converted to a Unicode 'COMBINING DOUBLE INVERTED BREVE' ("0xCD 0xA1" in UTF-8 [1]). That doesn't sound like something that MARC::Charset would do. -- Michael [1] Unicode Character 'COMBINING DOUBLE INVERTED BREVE' (U+0361) http://www.fileformat.info/info/unicode/char/0361/index.htm # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Friday, May 18, 2007 5:49 AM > To: perl4lib@perl.org; [EMAIL PROTECTED] > Subject: MARC::Charset question > > Hi, > > I'm using marc8_to_utf8() on Library of Congress data. I'm > finding that I get occasional null characters inserted in the > output text, and I'm wondering what this means. > > An example is the author (personal name) of the book that can > be found at http://catalog.loc.gov/ by searching for ISBN > 5040039875 (I'm guessing the fact that the website appears to > be displaying a corrupted name may be part of the problem here). > > This name is 'Dontsova, Daria' (approximately), in hex: > 446f6eeb74ec736f76612c20446172a7eb69ec612e. When transcoded by > marc8_to_utf8() the result is > 446f6e74cda173006f76612c20446172cab969cda161002e - which > contains 2 null (00) characters. > > Is it safe to ignore these null characters (i.e. strip them > out of the result, which otherwise seems good)? > > Thanks, > > Michael >