Re: Displaying diacritics in a terminal vs. a browser
> A MARC-8 sequence places a combining diacritical mark BEFORE the letter > it's supposed to combine. Whereas Unicode syntax is to put it AFTER the > letter it's supposed to combine with. > > Hence for example the letter: ZÌ > is produced by the MARC-8 Sequence: > 75 5A (macron below + "Z") > but > 0331 005A ("Z" + Combining Macron below) in Unicode. > > I believe if you don't account for this in your UTF-8 transformation, you > will get either no combining or combining with the wrong character. Just FYI in case anyone is curious about what MARC::Charset does, to_utf8() will take care of repositioning the diacritics from before to after the character that they modify. //Ed
RE: Displaying diacritics in a terminal vs. a browser
Jane, Thanks very much for the information about Unicode and MARC-8. I still have a lot to learn about the two formats! Since my MARC data is being manipulated primarily in a browser via a cgi script, I'll forego writing a converter for the terminal display for now, but I eventually plan to do that. Thanks again! - Chris _ From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] Sent: Thursday, July 01, 2004 1:51 PM To: 'Christopher Morgan' Subject: RE: Displaying diacritics in a terminal vs. a browser Hi Chris, I hope my analysis is correct; I think that two problems are going on here: 1) Your terminal display is very likely not up to the "combining" aspect of combining diacriticals. 2) More importantly there is an important technical shift in placement of diacritical marks between MARC-8 and Unicode: A MARC-8 sequence places a combining diacritical mark BEFORE the letter it's supposed to combine. Whereas Unicode syntax is to put it AFTER the letter it's supposed to combine with. Hence for example the letter: ZÌ is produced by the MARC-8 Sequence: 75 5A (macron below + "Z") but 0331 005A ("Z" + Combining Macron below) in Unicode. I believe if you don't account for this in your UTF-8 transformation, you will get either no combining or combining with the wrong character. Hope that's useful. JJ **Views expressed by the author do not necessarily represent those of the Queens Library.** Jane Jacobs Asst. Coord., Catalog Division Queens Borough Public Library 89-11 Merrick Blvd. Jamaica, NY 11432 tel.: (718) 990-0804 e-mail: [EMAIL PROTECTED] FAX. (718) 990-8566 -Original Message- From: Christopher Morgan [mailto:[EMAIL PROTECTED] Sent: Thursday, July 01, 2004 10:50 AM To: [EMAIL PROTECTED] Subject: Displaying diacritics in a terminal vs. a browser Hi all, I use the $cs->to_utf8 conversion from MARC::Charset to display MARC Authority records in a browser, and the diacritics display properly there. But they don't display properly via SDTOUT in my terminal window (I get two characters instead of one -- one with the letter and one with the accent mark). Am I doing something wrong? I'm using: binmode (STDOUT, ":utf8"); Is there any way around this problem, or is it a limitation of terminal displays? (I found a thread in the archives: http://www.mail-archive.com/[EMAIL PROTECTED]/msg00280.html that discusses a similar issue, but it didn't really answer my question). Thanks! -- Chris Morgan
Re: Displaying diacritics in a terminal vs. a browser
On Thu, Jul 01, 2004 at 11:22:42AM -0400, Houghton,Andrew wrote: > I'm not sure what MARC::Charset does internally, but MARC-8 > defines the diacritic separate from the base character. So > even using binmode(STDOUT,":utf8") will produce two characters, > one for the base character followed by the diacritic. If you > want them combined then you need to combine them. As you suggest Andy, MARC::Charset simply translates MARC-8 combining characters into UTF-8 combining characters. > It just so happens that I have recently been converting MARC-XML > to RDF. The RDF specification mandates Unicode Normal form C, > which means that the base character and the diacritic are > combined. MARC-XML uses Unicode Normal form D, which means that > the base character is separate from the diacritic. So I hacked > together some Perl scripts to convert Unicode NFD <-> Unicode NFC. > The scripts require Perl 5.8.0. Wow, I've always been under the impression that the character sets operated the same in RDF as they do in XML proper with the 'encoding' attribute: > I was talking with a colleague, just yesterday, about whether we > should unleash these on the Net... They need to be cleaned up a > little and need some basic documentation on how to run the Perl > scripts. It would be nice to have them wrapped up with a module interface for use in non-command-line apps. I'd would be open to integrating this functionality into MARC::Charset if you are interested. //Ed
RE: Displaying diacritics in a terminal vs. a browser
> From: Paul Hoffman [mailto:[EMAIL PROTECTED] > Sent: 01 July, 2004 11:57 > Subject: Re: Displaying diacritics in a terminal vs. a browser > > Unless I'm very much mistaken, Chris's code is outputting > UTF-8 to the terminal, not MARC-8. > >> From: Christopher Morgan [mailto:[EMAIL PROTECTED] > >> Sent: 01 July, 2004 10:50 > >> Subject: Displaying diacritics in a terminal vs. a browser > >> > >> (I get two characters instead of one -- one with the letter > >> and one with the accent mark). Am I doing something wrong? I realized that he was outputting UTF-8, but if he started with MARC-8 and used $cs->to_utf8 in MARC::Charset, MARC::Charset would most likely keep the data in Unicode Normal form D, which is why he sees two characters. When he views them with a browser, the browser most likely receives the two characters but, depending upon what fonts you are using, it will combine the two characters to look as *if* they are one combined character. > > http://mail.nl.linux.org/linux-utf8/2003-07/msg00231.html > Nice reference... Andy. Andrew Houghton, OCLC Online Computer Library Center, Inc. http://www.oclc.org/about/ http://www.oclc.org/research/staff/houghton.htm
Re: Displaying diacritics in a terminal vs. a browser
Unless I'm very much mistaken, Chris's code is outputting UTF-8 to the terminal, not MARC-8. The key is to find a terminal program that correctly displays UTF-8. I doubt you'll have any trouble finding one -- for example, there are at least two for Mac OS X alone (Terminal.app and iTerm). Depending on your platform, freshmeat.net or tucows.com may be the place to go. This thread from the linux-utf8 list may also be helpful (I googled for 'terminal UTF-8'): http://mail.nl.linux.org/linux-utf8/2003-07/msg00231.html Paul. On Thursday, July 1, 2004, at 11:22 AM, Houghton,Andrew wrote: From: Christopher Morgan [mailto:[EMAIL PROTECTED] Sent: 01 July, 2004 10:50 Subject: Displaying diacritics in a terminal vs. a browser I use the $cs->to_utf8 conversion from MARC::Charset to display MARC Authority records in a browser, and the diacritics display properly there. But they don't display properly via SDTOUT in my terminal window (I get two characters instead of one -- one with the letter and one with the accent mark). Am I doing something wrong? I'm using: binmode (STDOUT, ":utf8"); Is there any way around this problem, or is it a limitation of terminal displays? I'm not sure what MARC::Charset does internally, but MARC-8 defines the diacritic separate from the base character. So even using binmode(STDOUT,":utf8") will produce two characters, one for the base character followed by the diacritic. If you want them combined then you need to combine them. It just so happens that I have recently been converting MARC-XML to RDF. The RDF specification mandates Unicode Normal form C, which means that the base character and the diacritic are combined. MARC-XML uses Unicode Normal form D, which means that the base character is separate from the diacritic. So I hacked together some Perl scripts to convert Unicode NFD <-> Unicode NFC. The scripts require Perl 5.8.0. I was talking with a colleague, just yesterday, about whether we should unleash these on the Net... They need to be cleaned up a little and need some basic documentation on how to run the Perl scripts. Andy. Andrew Houghton, OCLC Online Computer Library Center, Inc. http://www.oclc.org/about/ http://www.oclc.org/research/staff/houghton.htm -- Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan [EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/
RE: Displaying diacritics in a terminal vs. a browser
Andy, Many thanks. I'd be interested in looking at your scripts if you do post them! -- Chris -Original Message- From: Houghton,Andrew [mailto:[EMAIL PROTECTED] Sent: Thursday, July 01, 2004 10:23 AM To: [EMAIL PROTECTED] Subject: RE: Displaying diacritics in a terminal vs. a browser > From: Christopher Morgan [mailto:[EMAIL PROTECTED] > Sent: 01 July, 2004 10:50 > Subject: Displaying diacritics in a terminal vs. a browser > > I use the $cs->to_utf8 conversion from MARC::Charset to display MARC > Authority records in a browser, and the diacritics display properly > there. > But they don't display properly via SDTOUT in my terminal window (I > get two characters instead of one -- one with the letter and one with > the accent mark). Am I doing something wrong? I'm using: > > binmode (STDOUT, ":utf8"); > > Is there any way around this problem, or is it a limitation of > terminal displays? I'm not sure what MARC::Charset does internally, but MARC-8 defines the diacritic separate from the base character. So even using binmode(STDOUT,":utf8") will produce two characters, one for the base character followed by the diacritic. If you want them combined then you need to combine them. It just so happens that I have recently been converting MARC-XML to RDF. The RDF specification mandates Unicode Normal form C, which means that the base character and the diacritic are combined. MARC-XML uses Unicode Normal form D, which means that the base character is separate from the diacritic. So I hacked together some Perl scripts to convert Unicode NFD <-> Unicode NFC. The scripts require Perl 5.8.0. I was talking with a colleague, just yesterday, about whether we should unleash these on the Net... They need to be cleaned up a little and need some basic documentation on how to run the Perl scripts. Andy. Andrew Houghton, OCLC Online Computer Library Center, Inc. http://www.oclc.org/about/ http://www.oclc.org/research/staff/houghton.htm
RE: Displaying diacritics in a terminal vs. a browser
> From: Christopher Morgan [mailto:[EMAIL PROTECTED] > Sent: 01 July, 2004 10:50 > Subject: Displaying diacritics in a terminal vs. a browser > > I use the $cs->to_utf8 conversion from MARC::Charset to > display MARC Authority records in a browser, and the > diacritics display properly there. > But they don't display properly via SDTOUT in my terminal > window (I get two characters instead of one -- one with the > letter and one with the accent mark). Am I doing something > wrong? I'm using: > > binmode (STDOUT, ":utf8"); > > Is there any way around this problem, or is it a limitation > of terminal displays? I'm not sure what MARC::Charset does internally, but MARC-8 defines the diacritic separate from the base character. So even using binmode(STDOUT,":utf8") will produce two characters, one for the base character followed by the diacritic. If you want them combined then you need to combine them. It just so happens that I have recently been converting MARC-XML to RDF. The RDF specification mandates Unicode Normal form C, which means that the base character and the diacritic are combined. MARC-XML uses Unicode Normal form D, which means that the base character is separate from the diacritic. So I hacked together some Perl scripts to convert Unicode NFD <-> Unicode NFC. The scripts require Perl 5.8.0. I was talking with a colleague, just yesterday, about whether we should unleash these on the Net... They need to be cleaned up a little and need some basic documentation on how to run the Perl scripts. Andy. Andrew Houghton, OCLC Online Computer Library Center, Inc. http://www.oclc.org/about/ http://www.oclc.org/research/staff/houghton.htm
Displaying diacritics in a terminal vs. a browser
Hi all, I use the $cs->to_utf8 conversion from MARC::Charset to display MARC Authority records in a browser, and the diacritics display properly there. But they don't display properly via SDTOUT in my terminal window (I get two characters instead of one -- one with the letter and one with the accent mark). Am I doing something wrong? I'm using: binmode (STDOUT, ":utf8"); Is there any way around this problem, or is it a limitation of terminal displays? (I found a thread in the archives: http://www.mail-archive.com/[EMAIL PROTECTED]/msg00280.html that discusses a similar issue, but it didn't really answer my question). Thanks! -- Chris Morgan