Re: Character identities
At 11:37 25.10.2002 -0700, Doug Ewell wrote: Marc Wilhelm Küster kuester at saphor dot net wrote: As to the long s, it is not used for writing present-day German except in rare cases, notably in some scholarly editions and in the Fraktur script. Very few texts beyond the names of newspapers are nowadays produced in Fraktur. To put the long s on the German keyboard would be quite contrary to user requirements -- and if a requirement existed, it would be DIN's job to amend DIN 2137-2 and the upcoming DIN 2137-12 to cater for it. Irrelevant, sure, but contrary? I don't see what harm could come from adding a character to a previously unassigned key, especially in the relatively obscure AltGr zone (Level 3). Most users could safely ignore it, and most would never even know it was there. In principle, you are right. Unfortunately, there's quite a bit of software around that (mis-)uses unassigned AltGr-Keys for their own purposes - this includes, on Windows NT ff at least, software such as the localized MS Word. So, adding new assignments potentially clashes with existing software and should only be done if there is a sufficiently high public interest in doing so. But yes, of course it would be DIN's job to standardize such a thing (or not). Patrick Andries asked if a revised German keyboard standard would be ignored in the market with the same cavalier attitude seen in Canada (and the U.S.). My impression is that European manufacturers are held more closely to conformance with national and international standards than North American manufacturers, but I'd want some Europeans to back me up on this. Speaking of Europe, it differs from country to country. In Germany certainly DIN 2137 is widely adhered to and changes to it would in all likelihood be taken up fast on the market. Best regards, Marc Küster -Doug Ewell Fullerton, California * Marc Wilhelm Küster Saphor GmbH Fronländer 22 D-72072 Tübingen Tel.: (+49) / (0)7472 / 949 100 Fax: (+49) / (0)7472 / 949 114
BabelPad
BabelPad, my free Unicode plain text editor for Windows has now been released. Further information is available at http://uk.geocities.com/BabelStone1357/Software/BabelPad.html. BabelPad also includes input methods for a number of scripts which I am interested in, currently : Tibetan (using Extended Wylie) Yi (using standard romanisation) Note that although a build of BabelPad for Windows 95/98/ME is available, it is not as feature-complete as the builds for Windows NT 4.0 or Windows 2000/XP, and may not work properly when configured to use different fonts for different Unicode ranges. I haven't written the help system yet, but a FAQ is available at http://uk.geocities.com/BabelStone1357/Software/BabelPad.html. My thanks to those members of this list who have commented on the pre-releaee versions of BabelPad. Regards, Andrew
RE: Character identities
... For this reason it is quite impermissible to render the combining letter small e as a diaeresis So far so good. There would be no reason for doing such a thing. ... or, for that matter, the diaeresis as a combining letter small e (however, you see the latter version sometimes, very infrequently, in advertisement). This is the case I though we were discussing, and it is a very different case. No, the claim was that diaresis and overscript e are the same, so the reversed case Marc is talking about is not different at all. Standing Keld's opinion and Marc's wholehearted support, it Please don't confuse me with Keld! follows that those infrequent advertisements should be encoded using U+0364... But U+0364 (COMBINING LATIN SMALL LETTER E) belongs to a small collection of Medieval superscript letter diactrics, which is supposed to appear primarily in medieval Germanic manuscripts, or to reproduce some usage as late as the 19th century in some languages. Yes, but you should not read too much into the explanation, which, while correct, does not limit the existence of their glyphs to fonts used only by germanic professors... Some of them (overscript e in particular) should be(come) quite commonly occurring in any Fraktur Unicode font. Using such a character to encode 21st century advertisements is doomed to cause problems: 1) The glyph for U+0364 is more likely found in the font collection of the Faculty of Germanic Studies that on the PC of people wishing to read the advertisement for Ye Olde Küster Pub. So, most people will be unable to view the advertisement correctly. 2) The designer of the advertisement will be unable to use his spell-checker and hyphenator on the advertisement's text. Advertisements should invariably be final spell-checked and hyphenated by humans! Automated spell checkers and hyphenators for German (as well as Scandinavian languages) have (so far) not been good enough even for running text that you want to publish... 3) User's will be unable to find the Küster Pub by searching Küster in a search engine. Depends on the search engine, and if it uses a correct collation table (for the language) or not... What will actually happen is that everybody will see an empty square, so they'll think that the web designer is an idiot, apart the professors at the Faculty of Germanic Studies, who'll think that the designer is an idiot because she doesn't know the difference between U+0308 and U+0364 in ancient German. Most modern use of Fraktur seem to use diaeresis or double acute for this. (But the web designer could use a dynamically downloaded font fragment, if there is worry that all glyphs might not be supported by the fonts used by the vast majority of the target audience.) The real error (IMHO) is the idea that font designers should stick to the *sample* glyphs printed on the Unicode book, because this would force Well, the diacritics are allocated/unified on glyphic grounds. While a diaeresis may look different from font to font, it is basically two dots (of some shape in line with the design of the font), never an e shape. At least not in the *default mode* of a *Unicode font*. And overscript small e will also vary with the font, looking like a shrunken ordinary e glyph of (ideally) the same font. But never like two dots (in the default mode of a Unicode font). graphic designer to change the *encoding* of their text in order to get the desired result. A graphic designer is likely to turn the whole thing into 2-d or 3-d graphics, probably distorted, possibly animated, to get the desired result! At which point the original, or intemediary, encoding of any text elements is not very relevant to the end result. Another big error (IMHO, once again) is the idea that two different Unicode characters should look different. I have never said that! E.g., a µ as well as an Å (both of which are allocated twice!) should look the same (resp.) regardless of which of their respective code points is used. There are many more examples of characters that definitely should (e.g. capital K and Kelvin sign, small i and small roman numeral one) or may (capital A, capital Alpha, ...) look the same. There are also lots of characters that mean the same, but always (in a Unicode font in default mode) should/must look different. Like M and Roman Numeral One Thousand C D (just to take an example closer to Italy... ;-). The difference must be preserved when it is useful -- e.g., U+0308 should not look like U+0364 in a should not -- must never font designed for publishing books on the history of German! a font . -- any Unicode font in default mode (Bad example, Marco!) What should really happen, IMHO, is that modern German should be encoded as modern German. A U+0308 (COMBINING DIAERESIS) should remain a U+0308, regardless that the corresponding glyph *looks* like U+0364 (COMBINING LATIN SMALL LETTER E) in one font, and it looks
Re: Character identities
On Mon, Oct 28, 2002 at 11:21:30AM +0100, Kent Karlsson wrote: No, the claim was that diaresis and overscript e are the same, so the reversed case Marc is talking about is not different at all. The claim is, that for certain fonts, it is appropriate to image the a-umlaut character as an a^e. That doesn't imply anything about the other way around, or else t' could legally be displayed as a t with caron above. A U+0308 (COMBINING DIAERESIS) should remain a U+0308, regardless that the corresponding glyph *looks* like U+0364 (COMBINING LATIN SMALL LETTER E) in one font, and it looks like U+0304 (COMBINING MACRON) in another font, and it looks like two five-pointed start side-by-side in a third font, and it looks like Mickey Mouse's ears in Disney.ttf... These are all unacceptable variations in a *Unicode font (in default mode)*. But you can have all kinds of silly variations in *non*-Unicode fonts applied to Unicode text, including ciphers or rebuses... (ok, there are degrees...) Basically, any decorative or handwriting font can't be a Unicode font. (The glyph for my German teachers umlaut was definitely a macron.) Seems pointless to tell a lot of the fontmakers out there that they shouldn't worry about Unicode, because Unicode's only for standard book fonts, but that's the only way I can read your last statement. -- David Starner - [EMAIL PROTECTED] Great is the battle-god, great, and his kingdom-- A field where a thousand corpses lie. -- Stephen Crane, War is Kind
Copyright on gif images via http://www.unicode.org/cgi-bin/GetUnihanData.pl
I have asked this question before without answer so I am repeating again. The Unihan Database browser at http://www.unicode.org/cgi-bin/GetUnihanData.pl shows an example glyph via http://www.unicode.org/cgi-bin/refglyph?24-codepoint. I would like to use this image but where can I ask for the permission? I have written a CGI which renders a banner using the URI above but I am not sure if I can cache the image. The CGI is still on my intranet but I can disclose upon request. Dan Kogai or http://www.unicode.org/cgi-bin/refglyph?24-5f3e
RE: Character identities
Kent Karlsson wrote: For this reason it is quite impermissible to render the combining letter small e as a diaeresis So far so good. There would be no reason for doing such a thing. ... or, for that matter, the diaeresis as a combining letter small e (however, you see the latter version sometimes, very infrequently, in advertisement). This is the case I though we were discussing, and it is a very different case. No, the claim was that diaresis and overscript e are the same, The claim was that dieresis and overscript e are the same in *modern* *standard* German. Or, better stated, that overscript e is just a glyph variant of dieresis, in *modern* *standard* German typeset in Fraktur. Sorry if I haven't stated this clearly enough. so the reversed case Marc is talking about is not different at all. It is. In the first case, we are talking about a glyph variant in *modern* *standard* German, in the second case, we are talking about two different diacritics in some *other* context. (Ancient German? ancient Swedish?). Standing Keld's opinion and Marc's wholehearted support, it Please don't confuse me with Keld! Oooops! My apologies! follows that those infrequent advertisements should be encoded using U+0364... But U+0364 (COMBINING LATIN SMALL LETTER E) belongs to a small collection of Medieval superscript letter diactrics, which is supposed to appear primarily in medieval Germanic manuscripts, or to reproduce some usage as late as the 19th century in some languages. Yes, but you should not read too much into the explanation, which, while correct, does not limit the existence of their glyphs to fonts used only by germanic professors... Some of them (overscript e in particular) should be(come) quite commonly occurring in any Fraktur Unicode font. Commonly sounds funny near Fraktur... Using such a character to encode 21st century advertisements is doomed to cause problems: 1) The glyph for U+0364 is more likely found in the font collection of the Faculty of Germanic Studies that on the PC of people wishing to read the advertisement for Ye Olde Küster Pub. So, most people will be unable to view the advertisement correctly. 2) The designer of the advertisement will be unable to use his spell-checker and hyphenator on the advertisement's text. Advertisements should invariably be final spell-checked and hyphenated by humans! Automated spell checkers and hyphenators for German (as well as Scandinavian languages) have (so far) not been good enough even for running text that you want to publish... This has no connection with this discussion. However, IMHO, the presence U+0364 (COMBINING LATIN SMALL LETTER E) in a modern German or Swedish text is just a plain spelling error, and even the naivest spellchecker should flag it as such. 3) User's will be unable to find the Küster Pub by searching Küster in a search engine. Depends on the search engine, and if it uses a correct collation table (for the language) or not... What will actually happen is that everybody will see an empty square, so they'll think that the web designer is an idiot, apart the professors at the Faculty of Germanic Studies, who'll think that the designer is an idiot because she doesn't know the difference between U+0308 and U+0364 in ancient German. Most modern use of Fraktur seem to use diaeresis or double acute for this. U+0308 (COMBINING DIAERESIS) should be the only umlaut to be found in modern German text. What that diacritic *looks* like (two dots, an e, a double acute, a macron, Mickey Mouse's ears), is a choice of the font designer. (But the web designer could use a dynamically downloaded font fragment, if there is worry that all glyphs might not be supported by the fonts used by the vast majority of the target audience.) This too has no connection with this discussion, and is OT. Unicode is concerned with how text is *encoded* the details of fonts and display technology are out of scope. What Unicode really mandates is that the encoding should not change to obtain a certain graphic effect. The real error (IMHO) is the idea that font designers should stick to the *sample* glyphs printed on the Unicode book, because this would force Well, the diacritics are allocated/unified on glyphic grounds. While a diaeresis may look different from font to font, it is basically two dots (of some shape in line with the design of the font), never an e shape. At least not in the *default mode* of a *Unicode font*. And overscript small e will also vary with the font, looking like a shrunken ordinary e glyph of (ideally) the same font. But never like two dots (in the default mode of a Unicode font). You haven't yet defined your meaning of Unicode font and, now, you add a new fancy term: default mode! What's a default mode? Unicode does not require fonts to have any kind of modes. You seem to be
Re: Character identities
Marco Cimarosti marco dot cimarosti at essetre dot it wrote: There are also lots of characters that mean the same, but always (in a Unicode font in default mode) should/must look different. Like M and Roman Numeral One Thousand C D (just to take an example closer to Italy... ;-). Well, the first and only time I have seen that Thousand C D was on the Unicode charts... However, if I'd be asked which glyph is more appropriate for that character, I would say: the same as capital M. I would disagree with this. It seems to me the whole reason for both U+216F ROMAN NUMERAL ONE THOUSAND and U+2180 ROMAN NUMERAL ONE THOUSAND C D to exist is that they should have different glyphs. This is not necessarily is keeping with the purest spirit of Unicode (which might regard these as two glyphs of a single character), but in reality they are encoded as two characters. Note, however, that there is nothing wrong with using the same glyph for U+004D and U+216F, although in many fonts they are different for no obvious reason. -Doug Ewell Fullerton, California
Re: Character identities
On 2002.10.28, 13:09, David Starner [EMAIL PROTECTED] wrote: Basically, any decorative or handwriting font can't be a Unicode font. ... Seems pointless to tell a lot of the fontmakers out there that they shouldn't worry about Unicode, because Unicode's only for standard book fonts Hm, what if I want to make, say, snow capped Devanagari glyphs for my hiking company in Nepal? Shouldn't I assign them to Unicode code points? -- . António MARTINS-Tuválkin| ()| [EMAIL PROTECTED] || R. Laureano de Oliveira, 64 r/c esq. | PT-1885-050 MOSCAVIDE (LRS) Não me invejo de quem tem | +351 917 511 549 carros, parelhas e montes | http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe | http://pagina.de/bandeiras/ a água em todas as fontes |
Re: Character identities
On 2002.10.28, 13:09, David Starner [EMAIL PROTECTED] wrote: Basically, any decorative or handwriting font can't be a Unicode font. ... Seems pointless to tell a lot of the fontmakers out there that they shouldn't worry about Unicode, because Unicode's only for standard book fonts Hello? Who says decorative or handwriting fonts can't be Unicode fonts? I've got dozens of fonts on my system that prove this wrong. Zapfino, which ships with OS X and which I had the privilege to work on, is about as decorative a handwriting font as you could wish for, and of course it has a Unicode cmap. Or are you working with some definition of 'Unicode font' other than 'font with a Unicode cmap'? John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] It is necessary that by all means and cunning, the cursed owners of books should be persuaded to make them available to us, either by argument or by force. - Michael Apostolis, 1467
Re: Character identities
At 20:59 + 2002-10-28, Anto'nio Martins-Tuva'lkin wrote: On 2002.10.28, 13:09, David Starner [EMAIL PROTECTED] wrote: Basically, any decorative or handwriting font can't be a Unicode font. ... Seems pointless to tell a lot of the fontmakers out there that they shouldn't worry about Unicode, because Unicode's only for standard book fonts Hm, what if I want to make, say, snow capped Devanagari glyphs for my hiking company in Nepal? Shouldn't I assign them to Unicode code points? That's what Private Use code positions are for. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Character identities
At 20:59 + 2002-10-28, Anto'nio Martins-Tuva'lkin wrote: On 2002.10.28, 13:09, David Starner [EMAIL PROTECTED] wrote: Basically, any decorative or handwriting font can't be a Unicode font. ... Seems pointless to tell a lot of the fontmakers out there that they shouldn't worry about Unicode, because Unicode's only for standard book fonts Hm, what if I want to make, say, snow capped Devanagari glyphs for my hiking company in Nepal? Shouldn't I assign them to Unicode code points? That's what Private Use code positions are for. -- Michael Everson * * Everson Typography * * http://www.evertype.com -- I don't think so. He seems to be talking about a specific typographic style. Code points don't care about style, whether it's Franklin Gothic or Snowcapped Helvetica. Don
Re: Character identities
At 13:36 -0700 2002-10-28, John Hudson wrote: Or are you working with some definition of 'Unicode font' other than 'font with a Unicode cmap'? It seemed to me that he was talking about fonts that had characters that weren't in Unicode at all. I don't mean precomposed vowels, but, say, fonts with moon phases in them. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Character identities
Hm, what if I want to make, say, snow capped Devanagari glyphs for my hiking company in Nepal? Shouldn't I assign them to Unicode code points? That's what Private Use code positions are for. -- Michael Everson * * Everson Typography * * http://www.evertype.com Um, Michael, I think Anto'nio was talking about glyphs in a decorative font, which should -- clearly -- just be mapped to ordinary Unicode characters, via an ordinary Unicode cmap. Or do you think that the yellow, cursive, shadow-dropped, 3-D letters Getaway! at: http://www.trekking-in-nepal.com/ should also be represented by Private Use code positions? ;-) --Ken
Re: Character identities
On Mon, Oct 28, 2002 at 09:36:34PM +, Michael Everson wrote: At 20:59 + 2002-10-28, Anto'nio Martins-Tuva'lkin wrote: On 2002.10.28, 13:09, David Starner [EMAIL PROTECTED] wrote: Basically, any decorative or handwriting font can't be a Unicode font. ... Seems pointless to tell a lot of the fontmakers out there that they shouldn't worry about Unicode, because Unicode's only for standard book fonts Hm, what if I want to make, say, snow capped Devanagari glyphs for my hiking company in Nepal? Shouldn't I assign them to Unicode code points? That's what Private Use code positions are for. But think of the utility if Unicode added a COMBINING SNOWCAP and COMBINING FIRECAP! But should we combine the SNOWCAP with the ICECAP? (-: -- David Starner - [EMAIL PROTECTED] Great is the battle-god, great, and his kingdom-- A field where a thousand corpses lie. -- Stephen Crane, War is Kind
Re: Character identities
On Mon, Oct 28, 2002 at 01:36:08PM -0700, John Hudson wrote: On 2002.10.28, 13:09, David Starner [EMAIL PROTECTED] wrote: Basically, any decorative or handwriting font can't be a Unicode font. ... Seems pointless to tell a lot of the fontmakers out there that they shouldn't worry about Unicode, because Unicode's only for standard book fonts Hello? Who says decorative or handwriting fonts can't be Unicode fonts? [...] Or are you working with some definition of 'Unicode font' other than 'font with a Unicode cmap'? Right above where it was cut it said: Marco: A U+0308 (COMBINING DIAERESIS) should remain a U+0308, regardless that the corresponding glyph *looks* like U+0364 (COMBINING LATIN SMALL LETTER E) in one font, and it looks like U+0304 (COMBINING MACRON) in another font, and it looks like two five-pointed start side-by-side in a third font, and it looks like Mickey Mouse's ears in Disney.ttf... Kent: These are all unacceptable variations in a *Unicode font (in default mode)*. Earlier: Marco: there are fonts which don't have dots over i and j; Kent: You have a slight point there, but those are not intended for running text. And I'm hesitant to label them Unicode fonts. Given that definition of Unicode fonts, a number of decorative or handwriting fonts (though fewer than I expected) are arbitrarily excluded from being Unicode fonts. -- David Starner - [EMAIL PROTECTED] Great is the battle-god, great, and his kingdom-- A field where a thousand corpses lie. -- Stephen Crane, War is Kind
Re: Character identities
At 14:30 -0800 2002-10-28, Kenneth Whistler wrote: Hm, what if I want to make, say, snow capped Devanagari glyphs for my hiking company in Nepal? Shouldn't I assign them to Unicode code points? That's what Private Use code positions are for. -- Michael Everson * * Everson Typography * * http://www.evertype.com Um, Michael, I think Anto'nio was talking about glyphs in a decorative font, which should -- clearly -- just be mapped to ordinary Unicode characters, via an ordinary Unicode cmap. If they correspond to Unicode characters, yes, certainly. Or do you think that the yellow, cursive, shadow-dropped, 3-D letters Getaway! at: http://www.trekking-in-nepal.com/ should also be represented by Private Use code positions? ;-) Not at all. Fonts with images of igloos and yurts would use it, though, I would think. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Character identities
At 14:31 -0800 2002-10-28, Figge, Donald wrote: At 20:59 + 2002-10-28, Anto'nio Martins-Tuva'lkin wrote: On 2002.10.28, 13:09, David Starner [EMAIL PROTECTED] wrote: Basically, any decorative or handwriting font can't be a Unicode font. ... Seems pointless to tell a lot of the fontmakers out there that they shouldn't worry about Unicode, because Unicode's only for standard book fonts Hm, what if I want to make, say, snow capped Devanagari glyphs for my hiking company in Nepal? Shouldn't I assign them to Unicode code points? That's what Private Use code positions are for. -- Michael Everson * * Everson Typography * * http://www.evertype.com -- I don't think so. He seems to be talking about a specific typographic style. Code points don't care about style, whether it's Franklin Gothic or Snowcapped Helvetica. I must have misunderstood. I think I only saw the snow-capped and not the Devanagari. Sorry. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Character identities
My USD 0.02, as someone who is neither a professional typographer nor a font designer (more than one, but not quite two, different things)... Discussions about the character-glyph model often mention the essential characteristics of a given character. For example, a Latin capital A can be bold, italic, script, sans-serif, etc., but it must always have that essential A-ness such that readers of (e.g.) English can identify it as an A instead of, say, an O or a 4 or a picture of a duck. (Mark Davis has a chart showing dozens of different A's in his Unicode Myths presentation.) Somewhere in between the obvious relationships (A = A, B ≠ A), we have the case pair A and a. They are not identical, but they are certainly more similar to each other than are A and B. It seems to me, as a non-font guy, that calling a font a Unicode font implies two things: 1. It must be based on Unicode code points. For True- and OpenType fonts, this implies a Unicode cmap; for other font technologies it implies some more-or-less equivalent mechanism. The point is that glyphs must be associated with Unicode code points (not necessarily 1-to-1, of course), not merely with an internal 8-bit table that can be mapped to Unicode only through some other piece of software. 2. The glyphs must reflect the essential characteristics of the Unicode character to which they are mapped. That means a capital A can be bold, italic, script, sans-serif, etc. A small a can also be small-caps (or even full-size caps), but I think this is the only controversial point. In a Unicode font, U+0041 cannot be mapped to a capital A with macron, as it is in Bookshelf Symbol 1; nor to a six-pointed star, as in Monotype Sorts; nor to a hand holding up two fingers, as in Wingdings. (But it can be mapped to a notdef glyph, if the font makes no claim to supporting U+0041.) U+0915 absolutely can have snow on it, or be bold or italic or whatever (or all of these), as long as a Devanagari reader would recognize its essential ka-ness. It cannot look like a Latin A, nor for that matter can U+0041 look like a Devanagari ka. Font guys, do you agree with this? Of course, the term Unicode font is also often used to mean a font that covers all, or nearly all, of Unicode. Font technologies generally don't even allow this, of course, and even by the standards of nearly we are still limiting ourselves to things like Bitstream Cyberbit, Arial Unicode MS, Code2000, Cardo, etc. Right or wrong, this is a commonly accepted meaning for Unicode font. -Doug Ewell Fullerton, California
Re: Character identities
I'm pretty much in agreement with what you say, except the following: Of course, the term Unicode font is also often used to mean a font that covers all, or nearly all, of Unicode. I would consider a Unicode font to be one that met your other conditions, aside from the repertoire. If I had a font that covered Latin, Greek and Cyrillic and worked with Unicode strings, for example, I would still consider that a Unicode font. I just wouldn't consider it a (pick your adjective) full / complete Unicode font. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: Doug Ewell [EMAIL PROTECTED] To: Unicode Mailing List [EMAIL PROTECTED] Sent: Monday, October 28, 2002 17:37 Subject: Re: Character identities My USD 0.02, as someone who is neither a professional typographer nor a font designer (more than one, but not quite two, different things)... Discussions about the character-glyph model often mention the essential characteristics of a given character. For example, a Latin capital A can be bold, italic, script, sans-serif, etc., but it must always have that essential A-ness such that readers of (e.g.) English can identify it as an A instead of, say, an O or a 4 or a picture of a duck. (Mark Davis has a chart showing dozens of different A's in his Unicode Myths presentation.) Somewhere in between the obvious relationships (A = A, B ≠ A), we have the case pair A and a. They are not identical, but they are certainly more similar to each other than are A and B. It seems to me, as a non-font guy, that calling a font a Unicode font implies two things: 1. It must be based on Unicode code points. For True- and OpenType fonts, this implies a Unicode cmap; for other font technologies it implies some more-or-less equivalent mechanism. The point is that glyphs must be associated with Unicode code points (not necessarily 1-to-1, of course), not merely with an internal 8-bit table that can be mapped to Unicode only through some other piece of software. 2. The glyphs must reflect the essential characteristics of the Unicode character to which they are mapped. That means a capital A can be bold, italic, script, sans-serif, etc. A small a can also be small-caps (or even full-size caps), but I think this is the only controversial point. In a Unicode font, U+0041 cannot be mapped to a capital A with macron, as it is in Bookshelf Symbol 1; nor to a six-pointed star, as in Monotype Sorts; nor to a hand holding up two fingers, as in Wingdings. (But it can be mapped to a notdef glyph, if the font makes no claim to supporting U+0041.) U+0915 absolutely can have snow on it, or be bold or italic or whatever (or all of these), as long as a Devanagari reader would recognize its essential ka-ness. It cannot look like a Latin A, nor for that matter can U+0041 look like a Devanagari ka. Font guys, do you agree with this? Of course, the term Unicode font is also often used to mean a font that covers all, or nearly all, of Unicode. Font technologies generally don't even allow this, of course, and even by the standards of nearly we are still limiting ourselves to things like Bitstream Cyberbit, Arial Unicode MS, Code2000, Cardo, etc. Right or wrong, this is a commonly accepted meaning for Unicode font. -Doug Ewell Fullerton, California
Re: Character identities
All this talk about the letter A reminded me of something from Hofstadter: The problem of intelligence, as I see it is to understand the fluid nature of mental categories, to understand the invariant cores of percepts such as your mother’s face, to understand the strangely flexible yet strong boundaries of concepts such as “chair” or the letter “a“ … The central problem of (artificial intelligence) is the question: What is the letter ‘a’ and ‘i’? ...By making these claims, I am suggesting that, for any program to handle letterforms with the flexibility that human beings do, it would have to possess full-scale general intelligence. -- Douglas R. Hofstadter, from one of his Metamagical Themas articles The notion that we could ever capture the essence of A-ness has already been discussed at length and dismissed as impossible without an AI breakthrough. :-) MichKa
Re: Character identities
Doug Ewell scripsit: 1. It must be based on Unicode code points. For True- and OpenType fonts, this implies a Unicode cmap; for other font technologies it implies some more-or-less equivalent mechanism. The point is that glyphs must be associated with Unicode code points (not necessarily 1-to-1, of course), not merely with an internal 8-bit table that can be mapped to Unicode only through some other piece of software. If it's a FIGlet font, of course, it's automatically Unicode, since FIGlet's table is 32 bits wide. In a Unicode font, U+0041 cannot be mapped to a capital A with macron, as it is in Bookshelf Symbol 1; nor to a six-pointed star, as in Monotype Sorts; nor to a hand holding up two fingers, as in Wingdings. (But it can be mapped to a notdef glyph, if the font makes no claim to supporting U+0041.) In fact, these fonts map these glyphs to U+F041. Only when seen as 8-bit fonts do they map to 0x41. -- With techies, I've generally found John Cowan If your arguments lose the first round http://www.reutershealth.com Make it rhyme, make it scan http://www.ccil.org/~cowan Then you generally can [EMAIL PROTECTED] Make the same stupid point seem profound! --Jonathan Robie
Re: Character identities
At 18:37 10/28/2002, Doug Ewell wrote: It seems to me, as a non-font guy, that calling a font a Unicode font implies two things: 1. It must be based on Unicode code points. For True- and OpenType fonts, this implies a Unicode cmap; for other font technologies it implies some more-or-less equivalent mechanism. The point is that glyphs must be associated with Unicode code points (not necessarily 1-to-1, of course), not merely with an internal 8-bit table that can be mapped to Unicode only through some other piece of software. My only ammendment to that would be: 'The point is that those glyphs that are intended to represent the default form of the characters supported by that font must be associated with Unicode codepoints, whether directly or indirectly, not merely...' Not every glyph in a font needs to be encoded, and in general glyph variants and things like ligatures should not be, unless standard Unicode codepoints happen to be available for them (even then, it would be legitimate to leave them unencoded and access them only via glyph processing features). 2. The glyphs must reflect the essential characteristics of the Unicode character to which they are mapped. That means a capital A can be bold, italic, script, sans-serif, etc. A small a can also be small-caps (or even full-size caps), but I think this is the only controversial point. Yes, I would agree with that, with the caveat that the A-ness of an A isn't necessarily something that can be defined: it can only be recognised. Of course, the term Unicode font is also often used to mean a font that covers all, or nearly all, of Unicode. Font technologies generally don't even allow this, of course, and even by the standards of nearly we are still limiting ourselves to things like Bitstream Cyberbit, Arial Unicode MS, Code2000, Cardo, etc. Right or wrong, this is a commonly accepted meaning for Unicode font. I really think we should all do what we can to bury this use of the term. It is singularly unhelpful, and the idea in the minds of some customers that they *need* a font that covers all of Unicode has not done anyone any good. Sure some font developers made some money making these ridiculously huge grab-bag fonts, but their time could have been much better spent. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] It is necessary that by all means and cunning, the cursed owners of books should be persuaded to make them available to us, either by argument or by force. - Michael Apostolis, 1467
The comet circumflex system.
Readers interested in internationalization using Unicode might like to know that I have recently added some documents about the comet circumflex system to the web. The introduction and index page are as follows. http://www.users.globalnet.co.uk/~ngo/c_c0.htm The main index page of the webspace is as follows. http://www.users.globalnet.co.uk/~ngo William Overington 29 October 2002
Re: Character identities
John Hudson commented. At 02:46 10/26/2002, William Overington wrote: I don't know whether you might be interested in the use of a small letter a with an e as an accent codified within the Private Use Area, but in case you might be interested, the web page is as follows. http://www.users.globalnet.co.uk/~ngo/ligatur5.htm I have encoded the a with an e as an accent as U+E7B4 so that both variants may coexist in a document encoded in a plain text format and displayed with an ordinary TrueType font. If anyone were interested, he could do this himself and use any codepoint in the Private Use Area. The meaning which I intended to convey was as follows. I don't know whether you might be interested in having a look at a particular example of the use of a small letter a with an e as an accent codified within the Private Use Area by an individual with an interest in applying Unicode, but in case you might be interested in having a look at that particular example, the web page is as follows. If, following from your response to the way that you read my sentence, someone were interested in defining a codepoint in the Private Use Area then certainly he or she could do that himself or herself and use any codepoint in the Private Use Area. However, exercising that freedom is something which could benefit from some thought. If someone wishes to encode an a with an e as an accent in the Private Use Area, he or she may wish to be able to apply that code point allocation in a document. If he or she looks at which Private Use Area codepoints are already in use within some existing fonts, then selecting a code point which is at present unused in those fonts might give a greater chance of his or her new character assignment being implemented than choosing a code point for which those fonts already have a glyph in use. Searching through such fonts takes time and requires some skill. If someone does wish to use a Private Use Area code point for an a with an e accent, then by using U+E7B4 does give a possible slight advantage in that the code point is already part of a published set of code points available on the web, for, even though that set of code points is not a standard, it is a consistent set and other people might well use those codepoints as well. However, anyone may produce and publish such a set of code point allocations of his or her own if he or she so wishes, or indeed keep them to himself or herself. Yet I was not seeking to make any such point in my posting. I simply added to a thread on a specialised topic what I thought might be a short interesting note with a link to a web page at which some readers might like to look. The web page indeed provides two external links to interesting documents on the web. Maybe it is time to include a note in the Unicode Standard to suggest that 'Private' Use Area means that one should keep it to oneself Well, at the moment the Unicode Standard does include the word publish in the text about the Private Use Area. I have published details of various uses of the Private Use Area on the web yet not mentioned them in this forum. For example, readers might perhaps like to have a look at the following. http://www.users.globalnet.co.uk/~ngo/ast07101.htm Anyone who chooses to do so might like to have a look at the following file as well, which introduces the application area. http://www.users.glpbalnet.co.uk/~ngo/ast02100.htm This is an application of the Unicode Private Use Area so as to produce a set of soft buttons for a Java calculator so that the twenty hard button minimum configuration of a hand held infra-red control device for a DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform) television can be used in a consistent manner to signal information from the end user to the computer in the television set. I am very pleased with the result. The encoding achieves a useful effect while being consistent for information handling purposes with the Unicode specification, so that an input stream of characters may be processed by a Java program without any ambiguity over whether a particular code point is a printing character or a calculator button (or indeed mouse event or simulated mouse event as mouse events are also encoded using the Private Use Area in my research). William Overington 29 October 2002
Re: Character identities
At 04:39 PM 10/28/2002 -0600, David Starner wrote: But think of the utility if Unicode added a COMBINING SNOWCAP and COMBINING FIRECAP! But should we combine the SNOWCAP with the ICECAP? (-: Unicode captures the ice-age during the global warming era! Do we have codepoints for images found on the walls of caves? :) Barry www.i18n.com