Re: Vertical BIDI
Philippe Verdy recently said: From: [EMAIL PROTECTED] What's uncertain is whether a lr or a rl progression is favored, given the paucity of evidence. Michael favors lr progression. There is no question that the text is read BTT. This creates an interesting problem: Put in the same sentence Han (Chinese) and Mongolian words in a vertical layout (I don't think this is unlikely, as Mongolian is also spoken in China, and there's also a Chinese community in Mongolia). So Chinese ideographs will be laid out vertically from top to bottom (but not rotated, except for a few characters like ideographic punctuation marks or symbols), and Mongolian will be laid out from bottom to top in their normal stack orientation. Such a text is clearly bidirectional, so we would need BiDi processing to order glyphs correctly. John's comment refers to Ogham. Mongolian goes top to bottom. Now try including some Latin words in this text (also not unlikely: there are lots of trademarks and people names that will need to be written with their normal Latin characters). If the text is presented vertically, there's a legitimate question of whever Latin should be rotated (but it will keep the Han flow direction.) Latin and Cyrillic are rotated 90 degrees clockwise when mixed with Mongolian in vertical lines. Presumably Arabic would be rotated 90 degrees anti-clockwise. (The ancestor of Mongolian was which is why the vertical lines go left to right.) One amusing aspect is that punctuation like ? and ! stay vertical at the end of Mongolian sentances, but are rotated at the end of Latin and Cyrillic ones. Mongolian is somewhat unusual in that nowadays when it is written in horizontal lines, it is rotated a further 90 degrees so it goes left to right and is upside down compared to the ancestral script. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: vertical direction control
Peter Kirk recently said: It seems strangely inconsistent to me that Unicode has detailed controls for horizontal layout direction and the complex bidi algorithm, but has nothing for vertical layout. I can force Latin text to be rendered right to left or Hebrew left to right (although such overrides are hardly plain text issues), but there is no way I can select vertical layout even for languages in which that is a normal way of writing. We already have U+202A LEFT-TO-RIGHT EMBEDDING and U+202B RIGHT-TO-LEFT EMBEDDING. It would be easy to define new characters TOP-TO-BOTTOM EMBEDDING and BOTTOM-TO-TOP EMBEDDING, with similar scope until the next PDF character. The difficult part would be implementing this, and before that defining the exact semantics (but Unicode could define the semantics as beyond its scope). (Another problem would be deciding which variant of mirrored characters e.g. brackets to use given that the context is neither RTL nor LTR - this is a problem with Egyptian hieroglyphs, many of which are mirrored in horizontal text.) For Egyptian hieroglyphs the characters generally face towards the start of the reading direction. (The occasional one is reversed, and sometimes whole texts face the wrong way.) So for horizontal l-to-r t-to-b face left, r-to-l t-to-b face right. For vertical t-to-b l-to-r face left, t-to-b r-to-l face right. In this case the fact the the inscription is top to bottom doesn't help - you need to know what the column arrangement is. You can even have both arrangements in one inscription, e.g. on either side of a doorway the figures face towards the door. (The bit over the door had the same arrangement as one of the sides rather than meeting halfway in the example I've seen.) IIRC it's like RLL R L R L Captions next to people in a larger picture usually face in the same direction as the person. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Unwanted publicity?
I was somewhat surprised to see the word Unicode on page 8 of the Metro newspaper (London, UK) today (January 28, 2004). Unfortunately it was in the middle of an article about Mydoom, where it says The message may read 'The message contains Unicode characters and has been sent as a binary attachment.' This was the only one of the possible messages they quoted, presumably because it was the most distinctive. The name Unicode is now in mailboxes around the world - is this a good or bad thing? Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: What is a process?
Peter Kirk wrote: As there hasn't been a rush of on-list responses to this one, and partly in reply to the one off-list response, let me clarify the issue I am have in mind. Instance A of a program P, version X, writes a Unicode character string S, in a particular normalisation form, to a storage medium Z. Some time later (maybe seconds, maybe years) instance B of version Y of that same program P reads that string from the same storage medium. For the purposes of Unicode conformance, are instances A and B to be considered one process or separate processes? I would say a process is something that carries out some sort of task on data. Typically data both comes in and goes out. It might be to the outside world or to a data store. Conformance clause C9 states that no process can assume that another process will make a distinction between two different, but canonical-equivalent character sequences, which implies that no process can assume that another process has correctly normalised any character sequence. So, if instances A and B are considered separate processes, B is not permitted to assume that the string S has been correctly normalised - even if in fact it is known that all strings on medium Z have been written by program P and that all versions of program P write strings in a particular normalisation form. I would consider A and B to be different versions of the same process. I read the word assume to mean make an assumption without definite knowledge. If process B *knows* something is true it can exploit that knowledge. If on the other hand it is receiving data from a process outside its control (owned by a third party perhaps) then it can't guess that the data have any particular charateristics. It is common for a process to be composed of sub-processes. If they can't exploit their knowledge of one another then you have serious problems. To take an extreme case how could you call a normalisation process if you couldn't rely on it returning normalised data? Also, can the storage medium Z be considered a process? No it is a data store. Or can low-level transformations of the data, e.g. defragmentation, backup and compression, which are invisible to the program P be considered processes? If so, these processes are permitted to transform S into a canonically equivalent form; and so instance B of program P is not permitted to assume that the string it reads from Z is in the same normalisation form as the string written by instance A. At some point your system will make use of a data store. It is entitled to assume that what it gets out of the store is what was stored into it. The operating system might make invisible compressions or duplications, but the system using the data store is oblivious to them. If the operating system doesn't return what was put in then it doesn't qualify for an *invisible* change. I would expect the operating system documentation to make very clear if the storage routines don't return what you gave them in the first place. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Punctuation symbols for partial cuneiform characters
John Cowan recently said: No, indeed. Even the hopeless innumerate should be able to grasp the ceiling and floor functions, however: the floor of four and a half is four, whereas its ceiling is five. Some speak of rounding down and rounding up respectively. The hopelessly innumerate might get confused with minus four and a half. The floor is minus five and the ceiling is minus four. (The floor goes towards minus infinity not zero.) Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: [Way OT] Beer measurements (was: Re: Handwritten EURO sign)
John Cowan recently said: Marco Cimarosti scripsit: You could generalize it a bit: Alignment Of Metric And Imperial Units Whose Difference Is So Small As To Be Pointless. E.g., I never understood why on earth metres and yards should be kept different. In a public park somewhere in UK or Ireland I have seen the following sign: Because the yard isn't just an isolated unit, like the pound in various European countries. It's part of a coherent (if profoundly messy) system. If we reduce the yard by 9%, the inch has to shrink too, and the last thing we want is to try to fit a 1/4 inch bolt (6.35 mm) into a nut whose inside diameter is only 5.81 mm. It's bad enough to have to have two kinds of hardware already: having incompatible things both labeled 1/4 inch would be the facilis descensus Averno indeed. In the UK the inch is now defined as 25.4mm rather than a subdivision of a standard yard kept under lock and key. If you peruse electronics catalogues you will discover that many components have leads spaced at a pitch of 2.54mm which seems a remarkable degree of accuracy. When I was younger they were a nice round 0.1. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Encoding: Unicode Quarterly Newsletter
Ken recently said: Not to disagree publicly with Michael or Mark on this, but in the interests of accuracy, I should point out that if the rest mass of the Unicode 4.0 publication is assumed to be exactly 4.1 kg (which then would, indeed, also be the case on our moon, or even a Jovian moon), and ignoring any relativistic corrections for relative motion -- since it is unlikely that anyone will be reading the standard while it is moving at a significant fraction of the speed of light -- then we can calculate the weight as being *approximately* 9.05 pounds (avoirdupois) [or 10.99 troy pounds]. I think relative motion cannot be ignored. The subjective weight will be much higher if the book is dropped on the reader's foot. Perhaps it should have very soft covers. Will the book have on the back cover a list of the languages that can be written with Unicode, and if so, what type size will be used? Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Small Latin Letter m with Macron
John Hudson recently said: At 12:29 PM 1/16/2003, Timothy Partridge wrote: Charles Trice Martin wrote The Record Interpreter which lists words in record type and their expansion. The 2nd Edition (1910) has been reprinted many times. The 1999 reprint is a facsimile of the 1910 edition, rather than being re-typeset. The other standard text, which has the added benefit of being more international than _The Record Interpreter_, is Cappelli's _Lexicon abbreviaturarum_ . [snip] The abbreviated text in Cappelli is mostly handwritten (though in the introductory bits he does use 9 for a con sign and 2 for a round r!). I mentioned Martin because the abbreviation symbols are typeset. One challenge for representing abbreviations in plain text (as opposed to fancy) is the use of superscripts to represent some letters including this one have been omitted here. Meaning is lost without the superscripts. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Small Latin Letter m with Macron
John Jenkins said: On Thursday, January 16, 2003, at 01:29 PM, Timothy Partridge wrote: Yes, especially early printing of Latin documents. See for example Gutenberg's bibles. Well, for that matter, even current editions of Spenser's _Faerie Queene_ will use the occasional õ for on, and so on. At least as late as the 1970s the English Statutes in Force had Magna Carta in abbreviated Latin with English translation. It dates from 1297. Quite a lot of it has survived. Much of the sections about the liberties of the forest have been repealed because you can't go around killing wildlife in forests these days. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Small Latin Letter m with Macron
Cristoph Päper recently said: Kenneth Whistler: Christoph Päper asked: writing mm as only one m with a macron above. Handwritten forms and arbitrary manuscript abbreviations should not be encoded as characters. Although I've got no proof for it, I was told that it has also been used in print. Yes, especially early printing of Latin documents. See for example Gutenberg's bibles. In the nineteenth century, in England, many old handwritten records were were printed in record type. This is like ordinary type but contains extra characters for the abbreviation marks. (It is in a typical serif font, not a handwriting style font.) I think the reason for reproducing in the condensed form rather than expanding the abbreviations, was that some abbreviations have more than one interpretation. For legal records an incorrect expansion can have a significant effect. The literal transcription reduces this risk. (It still requires someone to read the old handwriting correctly.) Charles Trice Martin wrote The Record Interpreter which lists words in record type and their expansion. The 2nd Edition (1910) has been reprinted many times. The 1999 reprint is a facsimile of the 1910 edition, rather than being re-typeset. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Mongolian Encoding
You recently said: On Mon, 16 Dec 2002 09:30:10 -0800 (PST), [EMAIL PROTECTED] wrote: I think that it is intended to use the eqivalent Tibetian character sequences to produce the various types of Biruga, rather than MFVSs. Sound eminently sensible and Unicode-like to use Tibetan symbols for Mongolian where appropriate. Is the following what you're suggesting ? 1st variant form = U+0F04 3rd variant form = U+0F04, U+0F05 4th variant form = U+0F04, U+0F05, U+0F05 Yes. It's just my suggestion though. We'll have to see what everyone one else thinks. This does raise an issue over the rotated varient but that perhaps could become the standard glyph for the character in the Mongolian block. Is it possible to change the standard glyph for a character once it has been carved in stone on the Unicode code charts ? And if it were possible, then how would the horizontal form be represented ? There is no exactly corresponding form in the Tibetan block. Oops, I was reading my mail remotely and didn't have any books available and my memory failed me. You are quite right. I think we do need a variation selector for that varient. On the issue of glyphs, I think I am right in saying that Unicode doesn't standardise these. The ones in the code charts are just examples to aid in identifying the character. Font designers can do whatever they fancy, but if all their letter As came out looking like Bs they wouldn't be popular. The glyphs on the Mongolian code chart are especially unusual since some obscure varients have been picked to provide unique glyphs for the characters across the various sub-scripts. I would have thought that a keyboard for typing Sibe, say, would just have isolated / initial forms on the keys. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: In defense of Plane 14 language tags (long)
Doug Ewell recently said: 1. Language tags may be useful for display issues. Another use for language tagging is the correct formation of ligatures. E.g. fi ligature is fine in English, but causes problems in Turkish because of confusion with undotted i. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Variant selectors in Mongolian
Ken Whistler recently said: The value of the variant selector to the user is in knowing what the result is going to be, and this means that the variant form *must* be specified. It is. See above. How else can the variant selector be used to *select* a particular form? Selection implies a deliberate choice, not a willingness to accept any substitution a font might provide. I agree. Although variation selectors also imply willingness to accept fallback to default glyphs as legible alternatives, if not the desired alternatives. I'd like to suggest a particular example to clarify what you expect to happen. If the computer is asked to render toeroen (which is the penultimate word on page 547 of The World's Writing Systems, Daniels and Bright.), what do you expect the display to look like? I think the characters are U+1832 U+1825 U+1837 U+1825 U+1828 (I'm not sure about the n). My particular interest is the first U+1825. When there is no preceding vowel in the word, this character takes on a different medial form to distinguish it from the male U+1823. This is a normal behaviour of Mongolian. The different form is listed as being available with the use of a varient selector in the Unicode table. Would you expect the rendering software to spot there was no preceding vowel in the word and automatically select the correct medial glyph? Or would you expect the software to display the default medial glyph for U+1825 which looks like that for U+1823 and the user would have to include a varient selector 1 to achieve the desired result? Or to put it another way are the varient selectors rarely used (for unusual situations) or more frequently used for any situation where the default glyph in that position is not the desired one? I think this depends on whether the rendering software simply treats Mongolian as like Arabic with alternate glyphs available for selection, or has a deeper knowledge of the appearance of Mongolian. I believe Unicode should take an explicit position on this as it has important implications for successful rendering of plain text on various platforms. (If the deeper knowledge position is taken, which I think is of significant benefit to the user, then the exact rules that are to be supported need to be stated.) The UNU/IIST report 170 takes a third position on the issue and in section 5 seems to misunderstand Unicode's distinction between characters and glyphs and suggests the input method selects appropriate characters including some from the PUA for presentation forms and ligatures. This appears to me to be akin to a web font trick. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Saying characters out loud (derives from hash, pound, octothorpe?)
William Overington recently said: Still no olde worlde shoppe name with a yogh in though yet? :-) Why bother with an old one when there is a current shop with a yogh? Do you have a newsagent called Menzies in your part of England? (They have spread from Scotland.) That isn't a zed (or zee) in the name; it's a yogh. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
RE: Phaistos in ConScript
Marco recently said: 5. I find that mirroring the signs as you did in your font is an unhistorical. The whole corpus is right-to-left, and the fact that the signs where impressed with types makes it impossible that the signs could have been reversed. In academic books, it is common practice to type the disc's text left-to-right, but the signs are not reversed. [Michael] I have followed Egyptological -- and ancient Egyptian -- practice here. If the script is represented right-to-left the faces point to the right so that you read into their faces. If the script direction is reversed so that it is left-to-right, it is conventional -- among Egyptologists and ancient Egyptians -- to reverse the signs as well. I see. But Hieroglyphs were handwritten, not typed. Moreover, the mirroring of glyphs is actually attested for Egyptian. Godart does not reverse the glyphs even though he reverses the directionality, but I think it is *his* practice which is ahistorical, and I think it makes the text harder to read. And I suspect is has to do with the font technology he had in 1994 when he wrote his book. It's seems that July 2002 is our disagreement month... I think that Godart was perfectly right avoiding assumptions that he could not support: there is no reason to think that the Phaistos script should work as Egyptian hieroglyphs work. I would support you in this. Michael says that all the scripts in the region go both ways, but we don't even know that the disk is from the region. (And the headdresses apparently don't look local.) It might have come some way in trade. I feel tempted to protest that the characters aren't in the right order, but someone might take me up on that :-) I'm probably right though! [The reason I haven't replied directly to Michael's message is that something about his messages crashes my mail reader when I try it. Apologies to everyone for accidently including a load of message headers last time I tried a workaround.] Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
RE: Inappropriate Proposals FAQ
Marco Cimarosti recently said: - No presentation glyphs for shapes that can already be obtained using regular characters in conjunction with ZWJ or ZWNJ. Why not just presentation glyphs in general? We seem to have queries about Indian cojuncts fairly frequently. Some more suggestions (some of which have covered from other angles already) - No scripts with a limited body of text in existance. (No need to exchange or analyse on computer.) E.g. Phaistos disk script - No scripts which are poorly understood and it is not clear as to what the characters are. E.g. Rongo-rongo. - No symbols that are just a picture of something with no other meaning e.g. a dog. (These tend not to have a fixed conventional form.) - No symbols that are only used in diagrams rather than running text. e.g. electrical component symbols. - No personal, ideosyncratic or company logos. E.g. the artist when he was not known as Prince. - No archaic styles of existing characters. E.g. dotless j. - No control codes for fancy text. E.g. begin bold - No characters that can be obtained by using a different font with existing characters and have no semantic difference from the existing characters. - No proposals to rename existing characters. (But a clarifying note might be added.) - No proposals to reposition existing characters, e.g. so they sort better. - No proposals for a newly invented character since putting it in the standard would help promote its use. (Significant usage must come first.) Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Chromatic font research
Sampo Syreeni recently said: National flags are a far cry, true. Naval signalling ones perhaps aren't. They stand for characters and I believe in some variations for entire well-known concepts. They are utilized in a way we would expect characters to be. I don't think the entire collection of flags used around the world coincides neatly enough with an already encoded script to be considered pure glyph variants. And colors are certainly meaningful in this context. (I can't fathom why anyone would want to encode those, though. Anything you can do with flags you can do with ordinary characters, only more efficiently. However, this could serve as an example of a script which relies on color as an essential feature.) I'd agree that you wouldn't want to encode them, but you might want to make a font where each signaling flag is in the place of its corresponding character. That would be a use for chromatic fonts. The only other use that springs to mind is Egyptian hieroglyphics which have a colouring scheme when written in full colour. (Of course colour isn't *required* when reading them, it is just an aid that helps recognition.) As someone (Doug?) pointed out a little while back on another thread, fonts are (mis)used to hold collections of graphics conveniently. I imagine that if chromatic fonts were available this kind of usage would grow. It would also allow things like illuminated capitals to be put in a font rather than suplied as a collection of graphics files. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: 3 big bidi bugs
Bernard Miller recently said: This can be fixed by rewording step L2 such that a reversal happens from the highest embedding level to each lower contiguous embedding level, regardless if the embedding level is represented by a character on the line, until the embedding level of 1 is reached (or, as an optimization, until the first odd embedding level equal to or lower than the lowest embedding level represented by a character on the line). I had always interpreted L2 in the manner of your suggested correction, but perhaps the language could be clarified. (2) Line width dependent mangling, spelling conventions for quotes: What is the purpose of step X10 if not to allow something like LEFT DOUBLE QUOTATION MARK to be used as if it was an OPEN DOUBLE QUOTATION MARK? One simply puts an embedding inside a quotation, such as RLEquotationPDF. Surely if the quotation is meant to be right to left the RLE and PDF should be outside the entire thing, including the quotes. After all the intention is for the quotes to match the text is it not? (3) Mirroring ambiguities: What if eor = sor? text: R RLO whatever PDF N LRO whatever PDF embedding level at step X9: 1 3 3 1 2 2 directional type at step X10: R R R ? L L Have you perhaps misunderstood sor and eor? They are imaginary things inserted at the run boundaries, not a role undertaken by an actual character inside the run. For the above I make them as follows: text:R RLOwhatever PDFN LRO whatever PDF embedding level at step X9: 13 312 2 s es es es e directional type at step X10: R R R R R R R R ? L L L L L In particular at the start of the level 1 run in the middle the highest level on either side of the boundary is 3 so the direction of the sor (and the preceding eor) is R. At the end of the run the highest level is 2 so the eor is L as is that of the following sor. The Neutral has a conflict of directions surrounding it so it takes the embedding direction which is R. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
RE: [OT] Re: The exact birthday of French: 0842-02-14
Elliotte Rusty Harold recently said: What's really needed to conclusively disprove this hypothesis is a verifiable event well in the middle of the problematic years that can be dated both backwards and forwards in time; i.e. that can be established as N years before the present and X years after the reign of one of the Caesars (or something similarly well-established.) Here event should be understood quite broadly to include not only battles, deaths of kings etc. but also buildings, coins, natural phenomena like comets and eclipses, etc. The test of a good hypothesis is its falsifiability, and that's true whether it's right or wrong or somewhere in-between. What distinguishes science from pseudo-science (and perhaps history from pseudo-history) is that pseudo-science is generally not falsifiable. I think this hypothesis is clearly falsifiable. Is there an astronomer in the house? A potential problem with lunar eclipses is that the cycle repeats every 18 and a bit years, and this has been known for a long time. So a really ingenious faker could have cut out an appropriate number of years. Seems a bit of a leap though to realise that eclipses could be used to verify dates. As for the number of days out of sync since Julius Caesar's time, I don't have the full details but the calendar had problems after Julius changed it. His Greek astonomer said leap years every four years. So they did. Unfortunately the Romans counted inclusively but the Greeks exclusively (like we do). So every four years to the Romans is what we would call every three years. It took them a while to realise. Augustus had a go at the calendar too. Pinched a day from February leaving it with just 28/29 (Julius gave it 29/30) and gave it to the month renamed after him (so it would be the same length as July). Would that cause a one day shift of the spring equinox too? Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Synthetic scripts
Doug Ewell recently said: The closest I can come is something like a script that was invented, generally by one person and in a relatively short period of time, rather than evolving from existing scripts in a gradual and progressive manner. But right away that definition includes not only Shavian, Tengwar, Cirth, Klingon, and most of the contents of ConScript, but also Ethiopic, Cherokee, Canadian Syllabics, Gothic, Deseret, and maybe Yi Syllabics, all of which are already encoded in Unicode. [snip] I still believe that separating writing systems into a natural or real category and an artificial or fictional or synthetic category is much less straightforward than those labels imply. If I went to a community whose language doesn't have a written form and convinced them that Tengwar would be an ideal way of recording their culture, would that make Tengwar more legitimate? Or cause people to regard it as a higher priority? Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: ISO 3166 (country codes) Maintenance Agency Web pages move
John Cowan recently said: Just how old are house numbers, anyway? Not the *concept* of numbering houses (which seems to be 18th century), but actual unaltered house numbers? Anyone know? I would imagine they are relatively stable Property boundaries can be very long lived, since unless you own both properties either side of one, you can't change it (Govermental interference can change things without the owners' consent of course) A change of street name might be an opportunity for renumbering I wonder how long 10 Downing Street, London has been around? Tim -- Tim Partridge Any opinions expressed are mine only and not those of my employer
RE: UTF-17
Did anyone already proposed an *UTF-17S*, where astral characters are encoded with a 16-byte sequence? Actually this would be ideal for my astrological database programmed in FORTH. UTF-16 sorting compatibility is essential for my application. Due to a five character file name limit I'll have to call UTF17S UTF17, but I'm sure this won't confuse any of my users. Tim Historical footnote: The FORTH language would have been called FOURTH (it's creator felt it was fourth generation), but the OS it was written on had a limit of 5 letters for files.
Re: On the possibility of guidance code points for the Private Use Area
Peter recently said: William is certainly touching on an important issue: how does your software know how to interpret my PUA codepoints. I commend him for thinking about the issue, and his thinking outside the box. I don't think I or SIL would buy into his suggestion, however. The biggest flaw, which thoroughly undermines the ability of this system to work, is that your software has no way to actually know whether I'm following these conventions or not. Effectively, you're still dependent upon individual agreement between users as to the meaning of PUA codepoints. A good point. A possible workaround would be a new plane-14 tag character. But as Ken points out the world isn't complex enough yet to need a standardised way of describing how you're being non-standard. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
RE: Final letters in Hebrew and Arabic
James Agenbroad recently said: On Sat, 10 Mar 2001, Jonathan Rosenne wrote: Regarding Hebrew: -Original Message- From: Nick NICHOLAS [mailto:[EMAIL PROTECTED]] Sent: Friday, March 09, 2001 10:12 PM To: Unicode List Cc: Nick NICHOLAS Subject: Final letters in Hebrew and Arabic (1) When a letter with a final variant appears alone --- say as a numeral, or in discussion of the letter or phoneme --- does it under any circumstances appear in its final form, or is it always medial? Monday, March 12, 2001 When Hebrew letters are used as numbers, (probably not a current mainstream practice) the final forms of kaph, mem, num, pe and ssadhe are used to repreent 500, 600, 700, 800 and 900. My source: "Alphabete und Schriftzeichen des Morgen- und des Abendlandes. 2. Aufl. Berlin: Bundesdruckeri, 1969. Hence my use of German transliterated letter names. Use of medial forms would thus change the numeric value; this would also mean the final forms could appear in the middle of of a number. Nakanishi (p. 32), Daniels and Bright, (p.490) and Van Ostermann (1952, p.120) only give numeric values for Hebrew letters through 400. I do not know if it is safe to infer from their silence that use of final forms for 500 to 900 is a seldom used twig of a seldom used branch. Gesenius' Hebrew Grammer Section 5k doesn't mention these. Instead it says a preceding taw is used to add an extra 400. It also says that thousands are sometimes denoted by two dots above the letter, e.g. aleph with two dots is one thousand. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Klingon silliness
Tex Texin recently said: Perhaps the real question is what is the criteria for including or excluding a fictional script. I have deleted John's mail, but his criteria applied more broadly than Klingon if I recall. Should we worry about elvish communication and not Klingon? Do we apply a business case to fictional scripts and not to other scripts? Some of these scripts are in the PUA Conscript registry. Perhaps if a significant body of text using a PUA encoding built up and was used for interchange between many interested parties then it could be considered for promotion into the standard. On the subject of it taking hundreds of years to fill up the reserved space, and the lack of available characters perhaps the most likely event for filling it is contacting aliens! How come the Klingons only have one language and script? :-) (Or did one of the movies have a diffent collection of glyphs?) Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re:FW: Greek questions
This isn't an official answer, but here goes. First, rendering of U+03C3 "Greek small letter sigma". Is it allowed and/or encouraged for an application to render this code as a final sigma glyph when it occurs word-finally, or would this behavior be incorrect? U+03C3 doesn't take contextual forms. Generally speaking the standard explicitly mentions contextual forms if shape changes are to take place. Hebrew is in a similar situation. Why is 03C2 not given a compatibility decomposition of "final 03C3"? No idea. Assuming that it is legitimate (and indeed should be recommended) for 03C3 to be rendered contextually, should there be a separate code for "Greek sigma symbol" that would be used by mathematicians, etc., when the "normal" behavior of the letter sigma is not wanted? As you point out contextual shaping would cause problems. There isn't another code AFAIK. Section 2.6 "Combining characters" states that "Some specific combining characters override the default stacking behavior...", [snip] Is there a definitive list of the "specific" combining characters that should exhibit such exceptional behavior? Or are implementors left to discover the exceptions for themselves? I'm not aware of a definitive list, and I agree one would be useful. I think Vietnamese and Hebrew are the only other ones. (That I can think of offhand.) Thai combiners keep a fixed distance from the base line, so although they stack they don't (need to) move. Tim
Re: Colours
William Overington" [EMAIL PROTECTED] said: I am reminded of some pictures I once saw on collectable postcards. The pictures were reproductions from a medieval book, possibly, but I am not sure, The Tres Riches Heures of the Duc du Berry, which is a famous manuscript book. Some of the numbers were black and some were red. Red letter days are certain Holy days and Saints' days in the Christian calendar. Apparently the list was standardised by the Council of Nicaea in A.D. 325. 25th Jan Conversion of St Paul 2nd Feb Purification 24th Feb St Matthias 25th Mar Annunciation Ash Wednesday 25th Apr St Mark 1st May St Phillip and St James Ascension Day 11th June St Barnabas 24th June St John the Baptist 29th June St Peter 25th July St James 18th Oct St Luke 28th Oct St Simon and St Jude 1st Nov All Saints 30th Nov St Andrew 21st Dec St Thomas Dateless days in the above depend on the date of Easter. Tim
RE: the Ethnologue
Peter Constable said: On 09/13/2000 12:04:24 PM "Ayers, Mike" wrote: What I'd really like to know is why there seems to be this insistence on only one official list of languages when there appears to be a clear need for two. There appears to be interest for a comprehensive, if imperfect, list on one hand, whereas other applications (web use, etc.) are interested in a fully researched list like RFC1766 provides. Why must these be the same list? Can't we acknowledge that it's going to take a long time to get everything right and work from two eventually converging lists? Just wonderin'... I have no problem with that whatsoever. Creating an alternate namespace mechanism with Ethnologue codes in a separate namespace seems to offer exactly what you describe. I'm wary of having two competing namespaces. As an alternative, I'd like to suggest something on the lines of en-cockney. Why not have iso-e-ethnologue as tags? This would be especially useful where there was just a miscellaneous ISO code. Applications could choose to parse just the ISO bit, or go for the full details. When extra languages are added to ISO, the tags would become out of date, but it would be relatively easier to identify which of the old tags needed updating. One potential snag is choosing which ISO tag would prefix a given Ethnologue tag. Perhaps SIL could give definitive opinions to avoid user divergence. Tim
Re: Splitting lists
Sarasvati recently said: Munzir Taha wrote: I vote to your suggestion of opening a separate list. Recently there have been a few suggestions for dividing the list into separate lists. Unfortunately, Sarasvati runs a Benevolent Dictatorship, not an Athenian Democracy, and she believes in Bacchanalian co-educational experiences for all. I don't think we're ready for a touch of satyr. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: Looking For Information
Harry Aufderheide recently said: I work for a large global firm in the transportation industry and we are taking a high-level look of our future business requirements for and the I.S. effort to properly handle all the characters of all the languages currently in use on the planet earth. I have some specific questions but am interested in hearing anything related to work effort required ,issues, concerns, etc. First some background. Our operating environment includes many IBM mainframes (multiple locations), AS/400s, UNIX platforms, various handheld data collection devices, and a large number of Windows NT clients and servers. Our applications run the gamut including data collection, customer focus internet, marketing, sales, financials, package tracking, billing, you name it we probably have it somewhere. Data for the most part is stored centrally on the IBM mainframes. Our programming languages also run the gamut including COBOL, C, C++, HTML, etc. We truly have an international presence but currently only receive data in English, French, Italian, German, and Spanish and, at least, some characters in other single byte languages. We are experiencing limited difficulties in properly handling all the single byte characters received. My belief is that this is due to program language character definition, code page, and EBCIDIC/ASCII differences on the various platforms. We are now "putting out fires" while looking for a better single byte solution and future double byte requirements. Based on everything that I have read the UNICODE standard is the way to go; hence my questions. 1. Is the UTF-8's character set equal to the Latin-1 (ASCII) Code Page's? If not, what are the differences? Under the assumption that it is substantially the same; I don't see it solving our problems as we are currently processing more characters than this can support. It certainly doesn't appear a solution for handling Chinese, Japanese, etc. This leads me to the UTF-16 format with its double byte capability. 2. I have read a good deal of material on support of UNICODE (UTF-x)on many platforms but have not found much about the mainframe (EBCIDIC) environment other than DB2 support for UNICODE. Assuming that we will have the need to process characters that require double byte technology and assuming that we have already done a good job of internationalizing our applications I have an interest in this sort of information too. The first question may be which versions of DB2 are in use. I think DB2 OS/400 supports CCSID 13488 UCS-2 Level 1 (UCS-2 is UTF-16 restricted to plane zero. It might manage UTF-16 too without too much effort.) I'm not sure whether DB2 on other platforms spports this CCSID. UTF-16 is a character set that uses two bytes, but I don't think that is quite the same as an IBM double byte character set (DBCS). I'm know very little about IBM DBCS, but the impression I have is that there are Shift In and Out control characters that swap between single and double byte modes. UTF-16 is modeless and is always two bytes. Could an IBMer shed light on the following: Do IBM DBCS strings assume starting in single byte mode? And would the presence of certain bytes in UTF-16 trigger a switch from double to single byte mode? IBM have defined UTF-EBCDIC. (Details available as a technical report on www.unicode.org) This converts Unicode characters into a variable number of bytes in a similar way that UTF-8 does. The basic letters A-Z and digits 0-9 are mapped to their corresponding EBCDIC codes. This means that when these particular characters are stored on an EBCDIC platform they are readable in that format. Other characters are mapped to sequences of non-control codes. This allows them to be shown on a terminal as wierd looking sequences of characters, but ones which won't send any wierd control codes to the terminal. Although UTF-EBCDIC exists I have not seen much sign of support for it. For example, is it possible to print UTF-EBCDIC on a mainframe printer? Can any terminals show it? (Or terminal emulators on PCs.) At the moment UTF-EBCDIC seems to be of most use if you want to use the mainframe as a database server and translate into UTF-16 or UTF-8 when talking to the outside world. (A simple translation program would be needed.) I see the need, across all platforms, for: - redesigning many of our files Extra length may be needed for some fields. - making program changes specific to these physical changes (file layouts, working storage, user interfaces) - modifying all logic operating on text (string) data Sorting and string comparison can be complex (this is due to the complexities of people's sorting needs, not anything inherent in Unicode.) Regards, Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer