Re: Phaistos in ConScript
Michael Everson wrote: > Say that we found another Phaistos document with the same string in > it, and were able to decipher Phaistos, and found that the string > matched in meaning and syntax to what's on the disk. Then we would > have a superfluous character encoded. You mean like U+0340 and U+0341? (ducking and running)
RE: Phaistos in ConScript
At 15:51 -0700 2002-07-08, Asmus Freytag wrote: >At 02:43 PM 7/8/02 +0100, Michael Everson wrote: >>Godart says "The last sign of set A:VIII was not deleted but broke >>off with a sliver of clay. Bearing mind the space and outline of >>the gap, which seems to roughtly follow the outline of the broken >>sign, it seems that the most plausible identification of the >>mysterious sign is a 3 [TATTOOED HEAD] or a 20 [DOLIUM], unless it >>is an 8 [GAUNTLET] or a 4 [CAPTIVE], which is less likely." I don't >>want to encode a new character without better evidence (and >>wouldn't for ANY script). I haven't seen anything from other >>scholars who consider it a 46th sign. > >This is an insufficient reason for not coding a symbol for >unidentified character, since it is unidentified. U+FFFD could be >pressed into service, but would be awkward if definite agreement on >identification is reached later, as it can be used for any >unidentified character, not just Phaistos. Sorry, this symbol is usually represented by a hatched pattern showing that something is missing. Godart uses [.] in his transcription. Since it is possible that sign 3, 20, 8, or 4 was actually there before the identifying clay broke off, it would be inappropriate to invent something new to represent the missing character. Say that we found another Phaistos document with the same string in it, and were able to decipher Phaistos, and found that the string matched in meaning and syntax to what's on the disk. Then we would have a superfluous character encoded. -- Michael Everson *** Everson Typography *** http://www.evertype.com
RE: Phaistos in ConScript
Ken. Thanks for your response. > > Now let us say I wish to represent this text LTR, as I do. Well if I >> reverse the presentation order without I get PLUMED-HEAD SHIELD CLUB >> PEDESTRIAN BOOMERANG -- but if I don't reverse the glyphs, than >> plumed-head is still facing to the right, as is the boomerang -- how >> am I to know that the directionality is LTR? > >Because then it will say: > >GNAREMOOB NAIRTSEDEP BULC DLEIHS DAEH-DEMULP As I said, the original might (assuming a syllabic structure and assigning random syllable values) well be LABUGIDANO, but when reversed it might read NODAGIBULA which could be a valid linguistic sequence. OK, so reading the whole text you would come up with readings which wouldn't make sense, so you would have to start over with a different directionality. Given the practice of the other scripts in the region, I consider this unlikely given its impracticality. The people who used scripts with multiple directionalities did reverse the glyphs when reversing the directionality. The inherent directionality of Phoenician BETH or of PLUMED-HEAD or of Egyptian WN (the bunny rabbit) lends itself to the use of such glyph-indicated directionality for text in general. I would not assume, additionally, that the Phaistos script would always be written on disks in spiral formatting. That too would be unlikely and impractical, would it not? > > I can't. I will start reading with the boomerang. > >What's the matter -- can't you read and write Phaistos correctly? Hmpf. > > That Godart did not make this correction in his book when he used LTR > > directionality was an error. I'm sticking by the decision I made when > > I made my fonts, because it is more likely to be right than not. > >I think you may be sticking your neck out rather far (to the left) >on this one. I am inclined to agree with Marco about the issue for >presentation. Why should you innovate over Godart here in this >*particular* instance, based on so little evidence. Because I suspect that Godart might well agree with me -- I don't imagine that he ever considered this aspect of text presentation. And because it makes sense given the context of other scripts in the region. >You could be right, but then you could be wrong, too. So could Godart! He was describing the disk, not thinking about encoding and presenting it! > > There aren't any other scripts in the area which change >directionality without > > reversing the glyphs, and Phaistos certainly isn't Chinese. > >Well that much I agree 100% with. My point being that though Beijing and Hong Kong newspaper headlines might present LTR or RTL directionality without mirroring, this practice is rare or indeed unknown in Europe at 1700 BCE. Well that's my opinion anyway. I suppose we could try to contact Godart and ask his opinion. It's not as though the CSUR is normative -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Whats the difference between a composite and a combining sequence?
That is also consistent with the glossary definitions: http://www.unicode.org/glossary. tex Kenneth Whistler wrote: > > Theodore, > > > http://www.unicode.org/unicode/reports/tr15/ mentions both > > composites and combining sequences. > > > > But it doesn't tell us the difference. I know what a combining > > sequence is. If I didn't know what a composite was, I'd guess it > > was the same thing as a combining sequence. > > See TUS 3.0, Chapter 3, pp. 43-44 > > D17 Combining character sequence: a character sequence consisting of > either a base character followed by a sequence of one or more > combining characters, or a sequence of one or more combining > characters. > > [e.g. A + combining-grave ] > > D18 Decomposable character: a character that is equivalent to a sequence > of one or more other characters, according to the decomposition > mappings found in the names list... It may also be known as a > precomposed character or composite character. > > [e.g. A-grave, U+00C0] > > --Ken -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: Whats the difference between a composite and a combining sequence?
Theodore, > http://www.unicode.org/unicode/reports/tr15/ mentions both > composites and combining sequences. > > But it doesn't tell us the difference. I know what a combining > sequence is. If I didn't know what a composite was, I'd guess it > was the same thing as a combining sequence. See TUS 3.0, Chapter 3, pp. 43-44 D17 Combining character sequence: a character sequence consisting of either a base character followed by a sequence of one or more combining characters, or a sequence of one or more combining characters. [e.g. A + combining-grave ] D18 Decomposable character: a character that is equivalent to a sequence of one or more other characters, according to the decomposition mappings found in the names list... It may also be known as a precomposed character or composite character. [e.g. A-grave, U+00C0] --Ken
Whats the difference between a composite and a combining sequence?
http://www.unicode.org/unicode/reports/tr15/ mentions both composites and combining sequences. But it doesn't tell us the difference. I know what a combining sequence is. If I didn't know what a composite was, I'd guess it was the same thing as a combining sequence. However, the two are meant to be different, so it can't be the same. If I am getting the Unicode terminology correct, a combining sequence is like a plain ASCII letter A, with the accent following.
Acrobat question
A bit off topic, but if you have a PDF file without page numbers on the original pages, is it possible to add them so that when it prints the page numbers appear? -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Multiple encodings for 1 character
>> For example, for filenames, OSX will encode an accented Roman >> letter one way, while for filenames Windows will encode it the >> other way. These kind of confusions are totally expected, if >> Unicode will allow more than one way to encode the same >> character. > > Perhaps a stray newsfeed routed via Alpha Centauri? > This is *very* old news, indeed. I'm new to this, though. >> This means that matching algorithm's won't work, because the >> characters are different! >> >> Will there be some kind of recommendation of which to avoid? >> Will the Unicode consortium make a standard to say that one of >> these encodings is strongly not recommended, and in fact >> depreciated? > > UAX #15: Unicode Normalization Forms > > http://www.unicode.org/unicode/reports/tr15/ Thanks. > And it is up to an implementation to specify which normalization > form it uses. > > By the way, we don't depreciate Unicode encodings -- we appreciate > them. ;-) Thats a shame. Simplicity is wonderful. -- Theodore H. Smith - Macintosh Consultant / Contractor. My website:
RE: Phaistos in ConScript
At 02:43 PM 7/8/02 +0100, Michael Everson wrote: >Godart says "The last sign of set A:VIII was not deleted but broke off >with a sliver of clay. Bearing mind the space and outline of the gap, >which seems to roughtly follow the outline of the broken sign, it seems >that the most plausible identification of the mysterious sign is a 3 >[TATTOOED HEAD] or a 20 [DOLIUM], unless it is an 8 [GAUNTLET] or a 4 >[CAPTIVE], which is less likely." I don't want to encode a new character >without better evidence (and wouldn't for ANY script). I haven't seen >anything from other scholars who consider it a 46th sign. This is an insufficient reason for not coding a symbol for unidentified character, since it is unidentified. U+FFFD could be pressed into service, but would be awkward if definite agreement on identification is reached later, as it can be used for any unidentified character, not just Phaistos.
Re: Multiple encodings for 1 character
Theodore wrote: > What is going to be done about the confusion generated from > having multiple ways to encode the same character? > > For example, for filenames, OSX will encode an accented Roman > letter one way, while for filenames Windows will encode it the > other way. These kind of confusions are totally expected, if > Unicode will allow more than one way to encode the same > character. Perhaps a stray newsfeed routed via Alpha Centauri? This is *very* old news, indeed. > > This means that matching algorithm's won't work, because the > characters are different! > > Will there be some kind of recommendation of which to avoid? > Will the Unicode consortium make a standard to say that one of > these encodings is strongly not recommended, and in fact > depreciated? UAX #15: Unicode Normalization Forms http://www.unicode.org/unicode/reports/tr15/ And it is up to an implementation to specify which normalization form it uses. By the way, we don't depreciate Unicode encodings -- we appreciate them. ;-) > And what about the OS that uses this encoding? How will the > Unicode consortium make the newly-offending OS change it's ways? It isn't offending, and the Unicode Consortium won't. --Ken
Re: Multiple encodings for 1 character
You will have to normalize the way the strings are processed, and you need to make sure it is done the same way everytime. Checkout ICU for this purpose. http://oss.software.ibm.com/icu/ Dave --- "Theodore H. Smith" <[EMAIL PROTECTED]> wrote: > What is going to be done about the confusion generated from > having multiple ways to encode the same character? > > For example, for filenames, OSX will encode an accented Roman > letter one way, while for filenames Windows will encode it the > other way. These kind of confusions are totally expected, if > Unicode will allow more than one way to encode the same > character. > > This means that matching algorithm's won't work, because the > characters are different! > > Will there be some kind of recommendation of which to avoid? > Will the Unicode consortium make a standard to say that one of > these encodings is strongly not recommended, and in fact > depreciated? > > And what about the OS that uses this encoding? How will the > Unicode consortium make the newly-offending OS change it's ways? > > And what about the hordes of apps that expect one format but > don't expect the other? And the hoardes of OS independant apps > (Java? Perl?) that might generate conflicting versions? > > = Dave Possin Globalization Consultant www.Welocalize.com http://groups.yahoo.com/group/locales/ __ Do You Yahoo!? Sign up for SBC Yahoo! Dial - First Month Free http://sbc.yahoo.com
RE: Phaistos in ConScript
You guys are not thinking things through. Firstly the fact that the only document we have was made with stamps rather than drawn by hand means nothing. Chinese can be written with a brush, a pen, a chisel, or it can be impressed into wax with a seal. You have to look at the structure of the script and think of legibility. Firstly, most of the glyphs are strongly directional. Let us assume that we have a string of text PLUMED-HEAD SHIELD CLUB PEDESTRIAN BOOMERANG (that's as encoded in the backing store). The script shows RTL directionality, and when reading it we read into the face of the PLUMED-HEAD. SHIELD and CLUB are symmetrical, but PEDESTRIAN and BOOMERANG are not. The characters display as BOOMERANG PEDESTRIAN CLUB SHIELD PLUMED-HEAD, where plumed-head faces right and the boomerang points right as well. We read RTL. Now let us say I wish to represent this text LTR, as I do. Well if I reverse the presentation order without I get PLUMED-HEAD SHIELD CLUB PEDESTRIAN BOOMERANG -- but if I don't reverse the glyphs, than plumed-head is still facing to the right, as is the boomerang -- how am I to know that the directionality is LTR? I can't. I will start reading with the boomerang. Let's pretend we knew the syllabic values of these characters. PLUMED-HEAD is LA, SHIELD is BU, CLUB is GI, PEDESTRIAN is DA, BOOMERANG is NO. The correct reading must be LABUGIDANO, but if you reverse RTL directionality to LTR directionality without reversing the glyphs, you won't know that the directionality is changed, and you will be tempted to read NODAGIBULA. And what if that was a valid sequence in your language? That Godart did not make this correction in his book when he used LTR directionality was an error. I'm sticking by the decision I made when I made my fonts, because it is more likely to be right than not. There aren't any other scripts in the area which change directionality without reversing the glyphs, and Phaistos certainly isn't Chinese. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Multiple encodings for 1 character
Theodore: Search the Unicode site for "normalization". -- Michael Everson *** Everson Typography *** http://www.evertype.com
Multiple encodings for 1 character
What is going to be done about the confusion generated from having multiple ways to encode the same character? For example, for filenames, OSX will encode an accented Roman letter one way, while for filenames Windows will encode it the other way. These kind of confusions are totally expected, if Unicode will allow more than one way to encode the same character. This means that matching algorithm's won't work, because the characters are different! Will there be some kind of recommendation of which to avoid? Will the Unicode consortium make a standard to say that one of these encodings is strongly not recommended, and in fact depreciated? And what about the OS that uses this encoding? How will the Unicode consortium make the newly-offending OS change it's ways? And what about the hordes of apps that expect one format but don't expect the other? And the hoardes of OS independant apps (Java? Perl?) that might generate conflicting versions?
RE: Phaistos in ConScript
Marco recently said: > > >5. I find that mirroring the signs as you did in your font is an > > >unhistorical. The whole corpus is right-to-left, and the > > fact that the signs > > >where impressed with types makes it impossible that the > > signs could have > > >been reversed. In academic books, it is common practice to > > type the disc's > > >text left-to-right, but the signs are not reversed. > > [Michael] > > I have followed Egyptological -- and ancient Egyptian -- practice > > here. If the script is represented right-to-left the faces point to > > the right so that you read into their faces. If the script direction > > is reversed so that it is left-to-right, it is conventional -- among > > Egyptologists and ancient Egyptians -- to reverse the signs as well. > > I see. But Hieroglyphs were handwritten, not "typed". Moreover, the > mirroring of glyphs is actually attested for Egyptian. > > > Godart does not reverse the glyphs even though he reverses the > > directionality, but I think it is *his* practice which is > > ahistorical, and I think it makes the text harder to read. And I > > suspect is has to do with the font technology he had in 1994 when he > > wrote his book. > > It's seems that July 2002 is our disagreement month... I think that Godart > was perfectly right avoiding assumptions that he could not support: there is > no reason to think that the Phaistos "script" should work as Egyptian > hieroglyphs work. I would support you in this. Michael says that all the scripts in the region go both ways, but we don't even know that the disk is from the region. (And the headdresses apparently don't look local.) It might have come some way in trade. I feel tempted to protest that the characters aren't in the right order, but someone might take me up on that :-) I'm probably right though! [The reason I haven't replied directly to Michael's message is that something about his messages crashes my mail reader when I try it. Apologies to everyone for accidently including a load of message headers last time I tried a workaround.] Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Saying characters out loud (derives from hash, pound, octothorpe?)
William Overington recently said: > Still no olde worlde shoppe name with a yogh in though yet? :-) Why bother with an old one when there is a current shop with a yogh? Do you have a newsagent called Menzies in your part of England? (They have spread from Scotland.) That isn't a zed (or zee) in the name; it's a yogh. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
RE: Phaistos in ConScript
Michael Everson wrote: > How much more imprudent is it to encode it as a unique character when > nothing is known about it? :-) :-) > >E.g. would you dare to unify it with U+0316 (COMBINING GRAVE > ACCENT BELOW) > >without knowing whether it is a stress mark, a tone mark, a > cantillation > >mark, a vowel muter, a full stop, a comma, a determinative for > >logographs...? > > I ask again: > > > > Do you have an analysis of all the signs which take it > in the document? > > > >Yes, in Louis Godart, "Il disco di Festo: l'enigma di una scrittura", > >Einaudi (Italy) 1994, ISBN 8806128922. An English > translation should now be > >available. > > OK, I have the English translation of it. But you want the character. > You do the work. Please look and tell me by cell number and character > (A-I-22, A-IV-1, B-VI-45) where they are actually applied. Be > comprehensive. Thanks. It will be a delightful activity for my vacations. (But I know what my wife will say: "Aren't you bringing *that* book with you again also this vacations, are you?") > > > I agree that those names aren't good. The dotted one > occurs at the > >> beginning of the text on both sides. PHAISTOS BEGINNING > OF TEXT and > >> PHAISTOS SEPARATOR then? > > > >Still assumptions, but much more reasonable. > > The one does begin the text on both sides, and the other does > separate. I was just implying that nothing more than "reasonable" can be said about character names for an unknown script. Nobody can honestly say they are "correct" or "incorrect". Imagine that these last two paragraphs were the only remains of English, it would be perfectly reasonable to chose the name ENGLISH BEGINNING OF TEXT for uppercase "I"... > > > I have followed Egyptological -- and ancient Egyptian -- practice > >> here. If the script is represented right-to-left the > faces point to > >> the right so that you read into their faces. If the > script direction > >> is reversed so that it is left-to-right, it is > conventional -- among > >> Egyptologists and ancient Egyptians -- to reverse the > signs as well. > > > >I see. But Hieroglyphs were handwritten, not "typed". > > And carved in stone and wood. Impressed in soft clay > probably. Probably? Never heard such a thing, apart seals. BTW, Egyptian would have required a big set of punches, and it would have posed complex kerning issues. > Your point? Handwriting (or hand carving) a mirrored version of a sign has no additional costs. Impressing a mirrored version of a sign means casting two (golden?) sets of punches. However, if you faithfully copy the glyphs seen on the disc, you cannot be wrong. If you don't, you can be right or wrong, depending on chance. > >Moreover, the mirroring of glyphs is actually attested for Egyptian. > > Yeah because you have thousands of documents. Mirroring is also > attested in Greek and Etruscan. I don't think I've erred in thinking > that it would apply to Phaistos in left-to-right directionality. The signs of Egyptian, Greek and Etruscan were all handwritten; those of Ph.D. weren't. Anyway, we know that Egyptian, Greek and Etruscan allowed mirroring; for Ph.D. we simply don't know. > > > Godart does not reverse the glyphs even though he reverses the > >> directionality, but I think it is *his* practice which is > >> ahistorical, and I think it makes the text harder to read. And I > >> suspect is has to do with the font technology he had in > 1994 when he > >> wrote his book. > > > >It's seems that July 2002 is our disagreement month... I > think that Godart > >was perfectly right avoiding assumptions that he could not > support: there is > >no reason to think that the Phaistos "script" should work as Egyptian > >hieroglyphs work. > > No way! *ALL* of the scripts of that part of the world show mirroring > of characters when the script direction is reversed. There's no > reason to assume that Phaistos would be otherwise. There are three very good reasons: 1) See the above about costs and planning ahead. 2) AFAIK, it is not true that *all* other scripts in the Mediterranean had mirroring. Particularly I never heard this for Linear A, Linear B and Cyprian, which are the most likely relatives of Ph.D. 3) Anyway, we don't know for sure which "part of the world" the Phaistos Disc is from. _ Marco
Re: Saying characters out loud (derives from hash, pound,octothorpe?)
I have heard: squiqqle for tilde bang for exclamation mark hook for question mark. tex Barry Caplan wrote: > > At 11:37 AM 7/5/2002 +0100, Michael Everson wrote: > >>Also, how does one say the U+007E character out loud while reading out the > >>address of a web page? > > > >"Tilde". Get real, William. > > FF5E is colloquially known as a "wave" in Japanese, IIRC, and hence 007E is a "small >wave" or "half width wave". > > Barry Caplan > www.i18n.com -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: Saying characters out loud (derives from hash, pound, octothorpe?)
At 11:37 AM 7/5/2002 +0100, Michael Everson wrote: >>Also, how does one say the U+007E character out loud while reading out the >>address of a web page? > >"Tilde". Get real, William. FF5E is colloquially known as a "wave" in Japanese, IIRC, and hence 007E is a "small wave" or "half width wave". Barry Caplan www.i18n.com
RE: Phaistos in ConScript
At 17:40 +0200 2002-07-08, Marco Cimarosti wrote: >Michael Everson wrote: >> >1. Your lacks an important sign, which I would call "PHAISTOS >> >COMBINING LINE BELOW". [...] >> >> Um, can't something from General Punctuation be used, in the absence >> of knowing more about this "character"? > >It seems very imprudent, considering that nothing is known abut the nature >of that a sign. How much more imprudent is it to encode it as a unique character when nothing is known about it? :-) >E.g. would you dare to unify it with U+0316 (COMBINING GRAVE ACCENT BELOW) >without knowing whether it is a stress mark, a tone mark, a cantillation >mark, a vowel muter, a full stop, a comma, a determinative for >logographs...? I ask again: > > Do you have an analysis of all the signs which take it in the document? > >Yes, in Louis Godart, "Il disco di Festo: l'enigma di una scrittura", >Einaudi (Italy) 1994, ISBN 8806128922. An English translation should now be >available. OK, I have the English translation of it. But you want the character. You do the work. Please look and tell me by cell number and character (A-I-22, A-IV-1, B-VI-45) where they are actually applied. Be comprehensive. Thanks. >BTW, the only thing I disliked in this excellent book was the fact that, >IMHO, Godart was to quick to accept the assumption that this sign could be >punctuation, and he even uses it to segment the text in "sentences" or >"veses". What page or section does he state that specifically? >Perhaps, it would be useful to have a (non PUA) Unicode symbol to mark >unidentified characters in any kind of paleographic or critic texts. This >could be the object of a proposal, or it could be unified with one of the >existing shaded rectangles. Markup. In my file I just wrote [.] as Godart did. But for Egypian and Cuneiform it's been suggested that markup is the appropriate means for showing this element of palaeography. > > I agree that those names aren't good. The dotted one occurs at the >> beginning of the text on both sides. PHAISTOS BEGINNING OF TEXT and >> PHAISTOS SEPARATOR then? > >Still assumptions, but much more reasonable. The one does begin the text on both sides, and the other does separate. > > I don't like VERTICAL LINE and DOTTED >> VERTICAL LINE very much. That kind of description we usually reserve >> for abstract technical symbols rather than punctuation. > >Punctuation? Did you discover it is punctuation? :-) Separators are punctuation. What else? Perhaps it is a 17th-century BCE spreadsheet. >OTOH, you know the Phaistos Disk "translators": for many of them, the >character names on your CSUR page make enough evidence that PHAISTOS SIGN OX >BACK was pronounced /bu/. (or even /kau as/ :-) There are silly people everywhere. > > I have followed Egyptological -- and ancient Egyptian -- practice >> here. If the script is represented right-to-left the faces point to >> the right so that you read into their faces. If the script direction >> is reversed so that it is left-to-right, it is conventional -- among >> Egyptologists and ancient Egyptians -- to reverse the signs as well. > >I see. But Hieroglyphs were handwritten, not "typed". And carved in stone and wood. Impressed in soft clay probably. Your point? >Moreover, the mirroring of glyphs is actually attested for Egyptian. Yeah because you have thousands of documents. Mirroring is also attested in Greek and Etruscan. I don't think I've erred in thinking that it would apply to Phaistos in left-to-right directionality. > > Godart does not reverse the glyphs even though he reverses the >> directionality, but I think it is *his* practice which is >> ahistorical, and I think it makes the text harder to read. And I >> suspect is has to do with the font technology he had in 1994 when he >> wrote his book. > >It's seems that July 2002 is our disagreement month... I think that Godart >was perfectly right avoiding assumptions that he could not support: there is >no reason to think that the Phaistos "script" should work as Egyptian >hieroglyphs work. No way! *ALL* of the scripts of that part of the world show mirroring of characters when the script direction is reversed. There's no reason to assume that Phaistos would be otherwise. >I don't think font technology had anything to do with this choice: from my >printed edition of "Il disco di Festo" I can see clearly that the text was >reproduced using little images, not a font (sometimes the borders of the >film and the adhesive tape are still visible). Right, so then he had a sheet of drawings photocopied dozens of times and pasted them down. He didn't think of directionality in the way we do I guess. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Chromatic text. (follows from Re: [unicode] Re: FW:Inappropriate Proposals FAQ)
At 15:19 +0100 2002-07-08, William Overington wrote: >Actually I was trying in the posting upon which you comment to suggest that, >even if people do not agree with me about having colour codes in a plain >text file, they might perhaps consider as a separate issue the adding into >regular Unicode of a zero width operator whose use would be to indicate that >a character, such as U+1362, should be decorated chromatically. no no No No NO. Characters are not distinguished by colour, unconfirmed statements about Aztec notwithstanding. >This would mean that a sequence U+1362 ZWJ ZWCDO could be used in >documents, which would give a chromatically decorated glyph with a >chromatic font yet would just give U+1362 as a monochrome character >if the font did not recognize the U+1362 ZWJ ZWCDO sequence. This is NOT ligation, and it is NOT what the ZWJ is for, and it is NOT an appropriate extension of the >My opinion is that splitting text files into just two categories, either >plain text or markup is not sufficient, but that there should perhaps be >more categories or, if there are but two categories that the dividing line >between them should be in a different place. Ten billion documents on the internet and the entire course of modern text processing indicate that it is unwise to hold the opinion that you do. Why don't you simply admit that you have been barking up the wrong tree and initiate more useful work? We have been about as civil as you can expect, though I am sure you have noticed that my own patience with this silliness is about at an end. >I tend to base the essential dividing line upon whether the encoding >of the file of code points is meaningful if one tries to compute the >effect of a code point upon the system as simply the effect of that >code point as it stands, without having to have software recognize a >character such as < and determine that a markup bubble is being >entered then to have to read in several more characters within the >markup bubble before taking any action as a result of the first >character in the sequence (that is, the < character) being read. Well get over it. You have seriously misconstrued the difference between "plain text" and "rich text". Both have been in use for many, many years and no one has had much trouble with it. Wondering "whether the encoding of the file of code points is meaningful" is not going to gain you very much in this line of misreasoning. >That distinction means that each Unicode character is processed as >it is received within the main loop of the program, without the >receiving of a < character putting the processing into an inner loop >within a markup bubble, within which bubble ordinary Unicode >character codes which are read have a >different meaning than in the Unicode specification. Processing of characters happens at many different levels. All I can say is that it is clear that you do not know what you are talking about. >To me, such a distinction means that people who are using lower cost, more >generally available software packages, might by such an approach be able in >the not too distant future to use files in a non-proprietary portable format >and get much better results than just using monochrome traditional plain >text. Balderdash. In the first place, those imaginary people are not expressing a user need for your pseudo-solutions. Real people use markup to colour their texts, and have been since the first colour monitors were introduced and MacWrite made it possible. What was that, 15 years ago? >Perhaps some sort of consensus over nomenclature for three categories of >text file could occur, namely plain text in the manner which you like it, >plain text in the manner in which I like it and markup. Maybe plain text, >enhanced text and markup would be suitable names. How do people feel about >that please? I feel ill. >It is unfortunately the case in discussions that when someone disagrees with >an idea that is put forward that he or she is more likely to respond in >public than if he or she agrees with an idea which is put forward, or has >simply read about the idea and just notes it as an interesting possibility. >This can have the effect that many people may agree with an idea or at least >not be against it yet make no comment, perhaps giving an impression that an >idea is not well received at large when in fact that is not necessarily the >case. Don't fool yourself. Your "plain text in the manner in which you like it" is a lementable abuse of character codes to effect the same results which real markup of various kinds has been able to do for decades. I assure you, the ranks of this list are not filled with people agreeing with you. -- Michael Everson *** Everson Typography *** http://www.evertype.com
RE: Chromatic text. (follows from Re: [unicode] Re: FW: Inappropriate Proposals FAQ)
William Overington wrote: > Actually I was trying in the posting upon which you comment > to suggest that, even if people do not agree with me about > having colour codes in a plain text file, they might > perhaps consider as a separate issue the adding into regular > Unicode of a zero width operator whose use would be > to indicate that a character, such as U+1362, should be > decorated chromatically. Come on, William!! Adding such a "zero width operator" *is* having color in plain text! And adding such "zero width operators" *is* inserting mark up in plain text! > >I interpret your post as one more lengthy repetition of your > well-known > >opinion: differences between "plain text" and "rich text" > should not exist: > >they should be eliminated by incorporating the mark-up in > the encoding. > > Actually, that is not my opinion. No, I know. This is my explanation of my perception of your explanation of your opinion. Now I am not sure what your perception of my explanation of my perception of your explanation of your opinion might be. Gentlemen, communication is such a difficult art! > [...] > Perhaps some sort of consensus over nomenclature for three > categories of > text file could occur, namely plain text in the manner which > you like it, > plain text in the manner in which I like it and markup. > Maybe plain text, > enhanced text and markup would be suitable names. How do > people feel about > that please? I would suggest "proletarian text", "middle-class text" and "capitalist text", if I wasn't so scared that someone could take it seriously. > It is unfortunately the case in discussions that when someone > disagrees with > an idea that is put forward that he or she is more likely to > respond in > public than if he or she agrees with an idea which is put > forward, or has > simply read about the idea and just notes it as an > interesting possibility. > This can have the effect that many people may agree with an > idea or at least > not be against it yet make no comment, perhaps giving an > impression that an > idea is not well received at large when in fact that is not > necessarily the > case. Yes, definitely a difficult art. _ Marco
RE: Phaistos in ConScript
Michael Everson wrote: > >1. Your lacks an important sign, which I would call "PHAISTOS > >COMBINING LINE BELOW". [...] > > Um, can't something from General Punctuation be used, in the absence > of knowing more about this "character"? It seems very imprudent, considering that nothing is known abut the nature of that a sign. E.g. would you dare to unify it with U+0316 (COMBINING GRAVE ACCENT BELOW) without knowing whether it is a stress mark, a tone mark, a cantillation mark, a vowel muter, a full stop, a comma, a determinative for logographs...? > Do you have an analysis of > all the signs which take it in the document? Yes, in Louis Godart, "Il disco di Festo: l'enigma di una scrittura", Einaudi (Italy) 1994, ISBN 8806128922. An English translation should now be available. BTW, the only thing I disliked in this excellent book was the fact that, IMHO, Godart was to quick to accept the assumption that this sign could be punctuation, and he even uses it to segment the text in "sentences" or "veses". Apart this detail, Godart made an excellent work in delivering all the known facts and rejecting all fantasy and indemonstrable assumptions. > >2. The last sign of the tenth group ("word"?) is almost totally > >lost, due to a crack. However, it seems than none of the 45 > known signs may > >fit in the gap. Many scholars consider this to be a 46th > sign. The glyph > >normally used is the literature is a texture of diagonal lines. > > Godart says "The last sign of set A:VIII was not deleted but broke > off with a sliver of clay. Bearing mind the space and outline of the > gap, which seems to roughtly follow the outline of the broken sign, > it seems that the most plausible identification of the mysterious > sign is a 3 [TATTOOED HEAD] or a 20 [DOLIUM], unless it is an 8 > [GAUNTLET] or a 4 [CAPTIVE], which is less likely." I don't want to > encode a new character without better evidence (and wouldn't for ANY > script). I haven't seen anything from other scholars who consider it > a 46th sign. Godart himself allows for this possibility in the book I mentioned above. But you are right, encoding this "phantom" characters would be a problem in case the missing character is identified. Perhaps, it would be useful to have a (non PUA) Unicode symbol to mark unidentified characters in any kind of paleographic or critic texts. This could be the object of a proposal, or it could be unified with one of the existing shaded rectangles. > >... about the character names: > > > >3. The names for E6FE and E6FF ("PHAISTOS PARAGRAPH SEPARATOR" and > >"PHAISTOS PHRASE SEPARATOR") show imprudent assumptions. E.g., many > >people consider E6FF to be a paragraph or text separator, and E6FE > >to be a word separator. It would be more prudent to use a more > >generic wording, e.g. "PHAISTOS VERTICAL LINE" and "PHAISTOS > >VERTICAL DOTTED LINE". > > I agree that those names aren't good. The dotted one occurs at the > beginning of the text on both sides. PHAISTOS BEGINNING OF TEXT and > PHAISTOS SEPARATOR then? Still assumptions, but much more reasonable. > I don't like VERTICAL LINE and DOTTED > VERTICAL LINE very much. That kind of description we usually reserve > for abstract technical symbols rather than punctuation. Punctuation? Did you discover it is punctuation? :-) > >4. Names such as "pedestrian", "plumed head", ... "wavy band" are > >just nicknames used by scholars, as opposed to accepted > identifications of > >the objects represented. It may be worth to emphasize this > in the character > >names: e.g., "PHAISTOS SIGN KNOWN AS PEDESTRIAN". > > We either use the numbers given by the scholars, so U+E6D0 can either > be called PHAISTOS SIGN-01 or PHAISTOS SIGN PEDESTRIAN. Either way > we're using a scholarly designation. The meaningful nicknames are > more fun than the numeric ones "PHAISTOS SIGN-01" would be too meaningless. I still feel ashamed for my stupid idea that Unicode Kang Xi radicals should have been called "KANG XI RADICAL 1" .. "KANG XI RADICAL 214". OTOH, you know the Phaistos Disk "translators": for many of them, the character names on your CSUR page make enough evidence that PHAISTOS SIGN OX BACK was pronounced /bu/. (or even /kau as/ :-) > >... and about the Everson Phaistos font: > > > >5. I find that mirroring the signs as you did in your font is an > >unhistorical. The whole corpus is right-to-left, and the > fact that the signs > >where impressed with types makes it impossible that the > signs could have > >been reversed. In academic books, it is common practice to > type the disc's > >text left-to-right, but the signs are not reversed. > > I have followed Egyptological -- and ancient Egyptian -- practice > here. If the script is represented right-to-left the faces point to > the right so that you read into their faces. If the script direction > is reversed so that it is left-to-right, it is conventional -- among > Egyptologists and ancient Egyptians --
Re: Chromatic text, ligatures and Fraktur ligatures.
I know I said this before, but this time I'm serious. I will no longer respond publicly to any post concerning William Overington's proposed extensions of the kind of things that should be encoded in Unicode. That is because I am convinced now that his misinterpretation of the basic principles of Unicode, and the types of entities that do and do not make sense for encoding, is willful and not due to ignorance. Nobody with the intelligence of a tree could possibly read the character-glyph document and come away with the impression that font styles, sizes, colors, etc. are "central" to the notion of what belongs in character encoding. Intelligence is clearly not the problem here. But, because I am not an ad hominem kind of guy, I will be happy to discuss other topics related to (and appropriate to) Unicode that are raised by William or anyone else. In my next message, I want to address the "large corporate sponsor" angle that William, and others in the past, have used to argue that Unicode is unresponsive to the needs of low-end users. -Doug Ewell Fullerton, California
Re: Chromatic text. (follows from Re: [unicode] Re: FW: Inappropriate Proposals FAQ)
Marco Cimarosti wrote as follows. >Of course you can. But my feeling is that you already *did* suggest this, >many and many times. Actually I was trying in the posting upon which you comment to suggest that, even if people do not agree with me about having colour codes in a plain text file, they might perhaps consider as a separate issue the adding into regular Unicode of a zero width operator whose use would be to indicate that a character, such as U+1362, should be decorated chromatically. This would mean that a sequence U+1362 ZWJ ZWCDO could be used in documents, which would give a chromatically decorated glyph with a chromatic font yet would just give U+1362 as a monochrome character if the font did not recognize the U+1362 ZWJ ZWCDO sequence. > >I interpret your post as one more lengthy repetition of your well-known >opinion: differences between "plain text" and "rich text" should not exist: >they should be eliminated by incorporating the mark-up in the encoding. > Actually, that is not my opinion. My opinion is that splitting text files into just two categories, either plain text or markup is not sufficient, but that there should perhaps be more categories or, if there are but two categories that the dividing line between them should be in a different place. I tend to base the essential dividing line upon whether the encoding of the file of code points is meaningful if one tries to compute the effect of a code point upon the system as simply the effect of that code point as it stands, without having to have software recognize a character such as < and determine that a markup bubble is being entered then to have to read in several more characters within the markup bubble before taking any action as a result of the first character in the sequence (that is, the < character) being read. That distinction means that each Unicode character is processed as it is received within the main loop of the program, without the receiving of a < character putting the processing into an inner loop within a markup bubble, within which bubble ordinary Unicode character codes which are read have a different meaning than in the Unicode specification. To me, such a distinction means that people who are using lower cost, more generally available software packages, might by such an approach be able in the not too distant future to use files in a non-proprietary portable format and get much better results than just using monochrome traditional plain text. Perhaps some sort of consensus over nomenclature for three categories of text file could occur, namely plain text in the manner which you like it, plain text in the manner in which I like it and markup. Maybe plain text, enhanced text and markup would be suitable names. How do people feel about that please? It is unfortunately the case in discussions that when someone disagrees with an idea that is put forward that he or she is more likely to respond in public than if he or she agrees with an idea which is put forward, or has simply read about the idea and just notes it as an interesting possibility. This can have the effect that many people may agree with an idea or at least not be against it yet make no comment, perhaps giving an impression that an idea is not well received at large when in fact that is not necessarily the case. William Overington 8 July 2002
RE: Phaistos in ConScript
At 14:11 +0200 2002-07-08, Marco Cimarosti wrote: >Michael Everson wrote: >> A Unicode-enabled font based on the ConScript encoding and a test >> page containing the entire Phaistos corpus can be found at >> http://www.evertype.com/standards/csur/phaistos-sample.html. > >I have a few notes about the repertoire: > >1. Your lacks an important sign, which I would call "PHAISTOS >COMBINING LINE BELOW". This is the only handwritten sign on the disc; it is >not clear whether it is some kind of diacritic (e.g. a sort of virama) or a >punctuation sign. At any rate, it is clear that the sign has been >deliberately written under the last signs of some groups ("words"?). Um, can't something from General Punctuation be used, in the absence of knowing more about this "character"? Do you have an analysis of all the signs which take it in the document? >2. The last sign of the tenth group ("word"?) is almost totally >lost, due to a crack. However, it seems than none of the 45 known signs may >fit in the gap. Many scholars consider this to be a 46th sign. The glyph >normally used is the literature is a texture of diagonal lines. Godart says "The last sign of set A:VIII was not deleted but broke off with a sliver of clay. Bearing mind the space and outline of the gap, which seems to roughtly follow the outline of the broken sign, it seems that the most plausible identification of the mysterious sign is a 3 [TATTOOED HEAD] or a 20 [DOLIUM], unless it is an 8 [GAUNTLET] or a 4 [CAPTIVE], which is less likely." I don't want to encode a new character without better evidence (and wouldn't for ANY script). I haven't seen anything from other scholars who consider it a 46th sign. >... about the character names: > >3. The names for E6FE and E6FF ("PHAISTOS PARAGRAPH SEPARATOR" and >"PHAISTOS PHRASE SEPARATOR") show imprudent assumptions. E.g., many >people consider E6FF to be a paragraph or text separator, and E6FE >to be a word separator. It would be more prudent to use a more >generic wording, e.g. "PHAISTOS VERTICAL LINE" and "PHAISTOS >VERTICAL DOTTED LINE". I agree that those names aren't good. The dotted one occurs at the beginning of the text on both sides. PHAISTOS BEGINNING OF TEXT and PHAISTOS SEPARATOR then? I don't like VERTICAL LINE and DOTTED VERTICAL LINE very much. That kind of description we usually reserve for abstract technical symbols rather than punctuation. >4. Names such as "pedestrian", "plumed head", ... "wavy band" are >just nicknames used by scholars, as opposed to accepted identifications of >the objects represented. It may be worth to emphasize this in the character >names: e.g., "PHAISTOS SIGN KNOWN AS PEDESTRIAN". We either use the numbers given by the scholars, so U+E6D0 can either be called PHAISTOS SIGN-01 or PHAISTOS SIGN PEDESTRIAN. Either way we're using a scholarly designation. The meaningful nicknames are more fun than the numeric ones >... and about the Everson Phaistos font: > >5. I find that mirroring the signs as you did in your font is an >unhistorical. The whole corpus is right-to-left, and the fact that the signs >where impressed with types makes it impossible that the signs could have >been reversed. In academic books, it is common practice to type the disc's >text left-to-right, but the signs are not reversed. I have followed Egyptological -- and ancient Egyptian -- practice here. If the script is represented right-to-left the faces point to the right so that you read into their faces. If the script direction is reversed so that it is left-to-right, it is conventional -- among Egyptologists and ancient Egyptians -- to reverse the signs as well. Godart does not reverse the glyphs even though he reverses the directionality, but I think it is *his* practice which is ahistorical, and I think it makes the text harder to read. And I suspect is has to do with the font technology he had in 1994 when he wrote his book. >IMHO, the two characters in points 1 and 2 absolutely needed. Academic works >which consider them as part of the script could not be encoded without them, >while academic works which don't need them are not disturbed by their >existence in the encoding. I didn't think so. Any counter-arguments to the above? I suppose this discussion could be instructive to potential script-proposers out there... ;-) -- Michael Everson *** Everson Typography *** http://www.evertype.com
Sinhala Unicode
It was recently mentioned that there don't seem to be any Unicode fonts that include Sinhala. Wayne Albury recently drew my attention to Helawadana 2000, which claims to allow editing of Sinhala (and Tamil) in Windows applications using Unicode fonts. I have not tried it (it costs $99 and the links to order it don't work!). For more information, see: http://www.microimage.com/helawadana/ Alan Wood Documentation Writer / Web Master Context Limited (http://www.context.co.uk) mailto:[EMAIL PROTECTED] http://www.alanwood.net (Unicode, special characters, pesticide names)
Re: Chromatic text, ligatures and Fraktur ligatures
At 10:40 +0100 2002-07-08, William Overington wrote: >Michael Everson wrote as follows. > >>Your courtyard codes and your scientific chromatic explorations are >>not appropriate uses of the standard. With Quark XPress I can set my >>fonts to display in HUNDREDS OF THOUSANDS if not MILLIONS of colours, > . > >Courtyard codes and chromatic fonts are, in my opinion, entirely appropriate >uses of the standard. Your would be wrong. >Recently I was referred to an ISO document about characters and glyphs, >ISO/IEC TR 15285. [...] Courtyard codes and codes for chromatic >fonts, in my opinion, fall within the definition of character in >Annex B of that document. Then you have not understood the definition, or you are twisting it to your own ends. The question is, are you twisting it because you really just don't get it, or are you doing this deliberately to waste our time and get some attention? Because it sure looks like one or another at this point. >Courtyard codes also allow the use of millions of colours. There are 18 >codes for changing colour, 16 for specific colours and 2 for colour 98 and >colour 99 which can be set to any of those millions of colours using other >courtyard codes. This "technology" is useless because there are already solutions in use by REAL applications involving text markup. >Courtyard codes are, in my opinion, very important for the future of >broadcasting using the DVB-MHP system. They will enable Unicode text files >to carry colour and formatting information which can be straightforwardly >interpreted by a variety of relatively small Java programs from a variety of >content providers. There are other methods of carrying colour and formatting information which are already in use. It is called markup. >The advantages for the broadcasting of educational multimedia across >whole continents will be enormous if a consistent set of codes for >colours and basic formatting is widely used in a consistent manner. Doh! Look! They've already invented it! IT'S CALLED MARKUP. Woo-hoo! >Certainly if such a set were provided in plane 0 of regular Unicode >then that would be magnificent, yet in any case, that takes time and the >need to gain a consensus as to the use of a particular set of codes is now, >and courtyard codes are, as far as I am aware, the only set of codes >available to do the job at the present time. You've deluded yourself into thinking that this is the way it should be done. It isn't, and therefore Unicode will never contain such codes. Get it? You're wasting your time and ours. > >If you can't support Unicode on older >>systems then that's because the systems aren't good enough. > >Ah! A digital divide issue. You've misused the term "digital divide". It does not have to do with software versioning. >Windows 95 and Windows 98 systems, which are >not very old at all, cannot, as far as I am aware, support advanced font >technology such as OpenType. In addition, these advanced font technologies >are not part of the international standards and it seems to me that it is a >good thing for Unicode to provide facilities for advanced font usage, yet >quite another thing to start cutting off support routes for users of older >equipment, even when that equipment is only three years old. Tough. That's the nature of software development. You try to support older data, but you don't resort to hacks to simulate new technological abilities in old systems. You take it as read that people will have to upgrade their software, hardware, memory, or whatever. Advanced font technologies should not be part of international standards. That isn't what international standards are for. Unicode, as has been pointed out to you before, isn't an international standard, although its repertoire and architecture is identical with the repertoire of ISO/IEC 10646. > >Are PUA hacks to fix that a productive use of energy? One can't support > >everything in legacy data. > >You appear to be referring to my definition of the golden ligatures >collection. All of your PUA "work", actually, not just that particular one. >Well, first of all, I feel that the word "hack" is inappropriate. >The golden ligatures collection is a published list of Private Use >Area allocations. The documents clearly state what they are and >what they are not. It allocates code positions for ligatures when it is the stated intent of the standard not to do so. And it does so in order to provide some sort of bogus support for "older systems". I think "hack" is quite descriptive of what you are trying to achieve via character encoding as opposed to markup. >The fact of the matter is that people who vote on these matters, largely >only having a vote because they are the representatives of large >corporations, have decided that no more precomposed ligatures will be added >into Unicode. Because the ones that are already there are only to support legacy data, and they are not recommended for us
RE: Chromatic text. (follows from Re: [unicode] Re: FW: Inappropriate Proposals FAQ)
William Overington wrote: > >The problem (if there is one!) is only for font technology. > > > >> Ethiopian writing: [...] "The capability to the same electronically > >> would be well received. /Daniel." > > > >Same for this one: Unicode's task was to provide a code point for the > >Ethiopic full stop, and they did. Whether the corresponding glyph is > colored > >or not is problem for fonts and word processors. > > Well, may I please suggest that the issue is one for Unicode > as well as for font technology? > > [...] Of course you can. But my feeling is that you already *did* suggest this, many and many times. I interpret your post as one more lengthy repetition of your well-known opinion: differences between "plain text" and "rich text" should not exist: they should be eliminated by incorporating the mark-up in the encoding. I think that it is your right to repeat your opinions as many times as you want. Nevertheless, I find that repeating opinions which are already well-known to everybody is *useless* and *boring*. _ Marco
RE: Phaistos in ConScript
Michael Everson wrote: > A Unicode-enabled font based on the ConScript encoding and a test > page containing the entire Phaistos corpus can be found at > http://www.evertype.com/standards/csur/phaistos-sample.html. I have a few notes about the repertoire: 1. Your lacks an important sign, which I would call "PHAISTOS COMBINING LINE BELOW". This is the only handwritten sign on the disc; it is not clear whether it is some kind of diacritic (e.g. a sort of virama) or a punctuation sign. At any rate, it is clear that the sign has been deliberately written under the last signs of some groups ("words"?). 2. The last sign of the tenth group ("word"?) is almost totally lost, due to a crack. However, it seems than none of the 45 known signs may fit in the gap. Many scholars consider this to be a 46th sign. The glyph normally used is the literature is a texture of diagonal lines. ... about the character names: 3. The names for E6FE and E6FF ("PHAISTOS PARAGRAPH SEPARATOR" and "PHAISTOS PHRASE SEPARATOR") show imprudent assumptions. E.g., many people consider E6FF to be a paragraph or text separator, and E6FE to be a word separator. It would be more prudent to use a more generic wording, e.g. "PHAISTOS VERTICAL LINE" and "PHAISTOS VERTICAL DOTTED LINE". 4. Names such as "pedestrian", "plumed head", ... "wavy band" are just nicknames used by scholars, as opposed to accepted identifications of the objects represented. It may be worth to emphasize this in the character names: e.g., "PHAISTOS SIGN KNOWN AS PEDESTRIAN". ... and about the Everson Phaistos font: 5. I find that mirroring the signs as you did in your font is an unhistorical. The whole corpus is right-to-left, and the fact that the signs where impressed with types makes it impossible that the signs could have been reversed. In academic books, it is common practice to type the disc's text left-to-right, but the signs are not reversed. IMHO, the two characters in points 1 and 2 absolutely needed. Academic works which consider them as part of the script could not be encoded without them, while academic works which don't need them are not disturbed by their existence in the encoding. _ Marco
longs
At 11:34 +0200 2002-07-08, Stefan Persson wrote: >- Original Message - >From: "John H. Jenkins" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Sent: Monday, July 08, 2002 12:56 AM >Subject: Re:_How_do_I_encode_HTML_documents_in_old_languages_=C5¬øuch as 17th >century Swediøh in Unicode? > > >> On Wednesday, July 3, 2002, at 11:10 AM, Stefan Persson wrote: >> >> > There is a big problem in the current Unicode øtandard, øince >> > Fraktur letters aren't øupported in any øuitable manner. >> >> Aargh! Medial long-s! Run away! Run away! :-) > >Why øhould I not uøe old characters that already were out-of-uøe centuries >ago? ;-) Becaufe it piffes people off? :-) -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: utf-8 and databases
> The primary concern is whether a database is able to represent the entire this was a question that came up about older middleware (cf5) that couldn't properly handle unicode, some folks (me included) were stuffing utf-8 into databases that didn't understand it (ie a char became a series of bytes). the question became, "are there any dbs that do"? and finally "how do you tell"? --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.373 / Virus Database: 208 - Release Date: 1/7/2545
Chromatic text, ligatures and Fraktur ligatures. (derives from Re: Chromatic text)
Michael Everson wrote as follows. >Your courtyard codes and your scientific chromatic explorations are >not appropriate uses of the standard. With Quark XPress I can set my >fonts to display in HUNDREDS OF THOUSANDS if not MILLIONS of colours, . Courtyard codes and chromatic fonts are, in my opinion, entirely appropriate uses of the standard. Recently I was referred to an ISO document about characters and glyphs, ISO/IEC TR 15285. This is available in a zipped format as follows. It unzips to a .pdf file. http://www.iso.ch/iso/en/ittf/PubliclyAvailableStandards/C027163e.zip Courtyard codes and codes for chromatic fonts, in my opinion, fall within the definition of character in Annex B of that document. This is not me finding some definition tucked away obscurely, it is central. The introduction section of the document states as follows. quote This Technical Report is written for a reader who is familiar with the work of SC 2 and SC 18. Readers without this background should first read Annex B, "Characters" and Annex C, "Glyphs". end quote Courtyard codes also allow the use of millions of colours. There are 18 codes for changing colour, 16 for specific colours and 2 for colour 98 and colour 99 which can be set to any of those millions of colours using other courtyard codes. Indeed, it is possible to use them with colours of more than 8 bits per colour channel so that they could be used for the high definition colour option of .png files if so desired. I may add a code into courtyard codes to signal that use option explicitly. Lots of programs can use millions of colours: expensive programs and widely available programs. It is part of modern computing. For example the Microsoft Paint program which can be used for preparing illustration files using a particular set of colours chosen from the millions of colours which the Paint program can be used to produce. There is an article about such a use in relation to preparing artwork for broadcasting upon the DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform) system at the following address. http://www.users.globalnet.co.uk/~ngo/pai07000.htm Courtyard codes are, in my opinion, very important for the future of broadcasting using the DVB-MHP system. They will enable Unicode text files to carry colour and formatting information which can be straightforwardly interpreted by a variety of relatively small Java programs from a variety of content providers. The advantages for the broadcasting of educational multimedia across whole continents will be enormous if a consistent set of codes for colours and basic formatting is widely used in a consistent manner. Certainly if such a set were provided in plane 0 of regular Unicode then that would be magnificent, yet in any case, that takes time and the need to gain a consensus as to the use of a particular set of codes is now, and courtyard codes are, as far as I am aware, the only set of codes available to do the job at the present time. >If you can't support Unicode on older >systems then that's because the systems aren't good enough. Ah! A digital divide issue. Windows 95 and Windows 98 systems, which are not very old at all, cannot, as far as I am aware, support advanced font technology such as OpenType. In addition, these advanced font technologies are not part of the international standards and it seems to me that it is a good thing for Unicode to provide facilities for advanced font usage, yet quite another thing to start cutting off support routes for users of older equipment, even when that equipment is only three years old. > Are PUA >hacks to fix that a productive use of energy? One can't support >everything in legacy data. You appear to be referring to my definition of the golden ligatures collection. Well, first of all, I feel that the word "hack" is inappropriate. The golden ligatures collection is a published list of Private Use Area allocations. The documents clearly state what they are and what they are not. http://www.users.globalnet.co.uk/~ngo/golden.htm The fact of the matter is that people who vote on these matters, largely only having a vote because they are the representatives of large corporations, have decided that no more precomposed ligatures will be added into Unicode. I have accepted that that was the situation in which we find ourselves and that it is pointless seeking to get the decision changed, so I have settled for the fact that they have made the decision and I have published the golden ligatures collection and if the golden ligatures collection gets widely used, then good. Since you raise the matter, however, I do feel that adding U+FB07 as a ct ligature would be useful and, indeed, the golden ligatures collection is designed so that the chosen code points dovetail nicely with the code points of the U+FB.. block of regular Unicode: the issue seems more one of the politics of simply ignoring the needs of people who are not using the very l
Re: Re: How do I encode HTML documents in old languages ſuch as 17th century Swediſh in Unicode?
- Original Message - From: "John H. Jenkins" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, July 08, 2002 12:56 AM Subject: Re:_How_do_I_encode_HTML_documents_in_old_languages_=C5¿uch as 17th century Swediſh in Unicode? > On Wednesday, July 3, 2002, at 11:10 AM, Stefan Persson wrote: > > > There is a big problem in the current Unicode ſtandard, ſince > > Fraktur letters aren't ſupported in any ſuitable manner. > > Aargh! Medial long-s! Run away! Run away! :-) Why ſhould I not uſe old characters that already were out-of-uſe centuries ago? ;-) Stefan _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Re: utf-8 and databases
Asmus is right that you shouldn't blithely assume that the encoding itself gives a performance advantage. However, I think this is more true when looking at software program efficiency then database efficiency. For example, some databases preallocate storage for records based on the fixed width of the record as n characters, and then allocate the maximum byte size of a character times n characters- So a 100 character record requires 400 bytes for each record, even though much of the data might actually be only 1 or two byte characters. You can then see some large growth in utf-8 databases over utf-16 (where the utf-16 versions allocate 16 bits instead of the maximal 32 per character). Similarly index keys are affected and if the key size has a low limit, choosing one encoding over the other might give migration headaches. I think Asmus and I are both saying you are likely asking the wrong question. The encoding choice is a "don't care", since there is a 1-1 relationship and a simple efficient algorithm for going between them. What you really want to ask of the vendor, and/or be testing for, is given the kinds of data and operations you need to perform, how efficient is the database at using its storage facilities, retrieving the data, and executing the various operations (search, sort, etc.), for each encoding. hth tex Asmus Freytag wrote: > > At 02:11 PM 7/7/02 +0700, Paul Hastings wrote: > >is there a standard test that can determine whether a given > >database can handle utf-8 (ie as "native" utf-8 not converting > >to ucs-2 or whatever)? > > Why is that of any interest? > > The primary concern is whether a database is able to represent the entire > repertoire of Unicode. Just create a string that contains the largest > character 0x10FFFD, convert it to whatever encoding form the APIs require > and see whether you get it back unmolested. > > A more sophisticated test would take a longer string and attempt to sniff > out incorrect truncation of characters. > > A secondary concern is performance. If the choice of encoding form is a > poor match for the actual data encountered, and if entering and retrieving > the data requires too many transcoding steps, it's conceivable that this > could be detected in the overall performance of the database. > > However, there's no reason to assume that a theoretical match in encoding > efficiency translates automatically into a more efficient database > implementation. > Therefore, regular benchmarking tools should be fine to determine database > performance, as long as the test data is representative for the installation. > > A./ -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: utf-8 and databases
At 02:11 PM 7/7/02 +0700, Paul Hastings wrote: >is there a standard test that can determine whether a given >database can handle utf-8 (ie as "native" utf-8 not converting >to ucs-2 or whatever)? Why is that of any interest? The primary concern is whether a database is able to represent the entire repertoire of Unicode. Just create a string that contains the largest character 0x10FFFD, convert it to whatever encoding form the APIs require and see whether you get it back unmolested. A more sophisticated test would take a longer string and attempt to sniff out incorrect truncation of characters. A secondary concern is performance. If the choice of encoding form is a poor match for the actual data encountered, and if entering and retrieving the data requires too many transcoding steps, it's conceivable that this could be detected in the overall performance of the database. However, there's no reason to assume that a theoretical match in encoding efficiency translates automatically into a more efficient database implementation. Therefore, regular benchmarking tools should be fine to determine database performance, as long as the test data is representative for the installation. A./