Re: Discrepancy between Names List Code Charts?
At 06:19 PM 15-08-02, James Kass wrote: Does anyone know of a writing system which actually uses the Latin letter t with a bona-fide cedilla? The newish Gagauz Turkish Latin-script orthography derives from both Turkish and Romanian models. This has led to a peculiar hybrid, in which the cedilla is used for the s and the commaaccent is used for the t. If the Gagauz Turks became interested in stressing their Turkishness, they might decide that both s and t should use the cedilla, but I've not seen any examples of this yet. I don't know of any other languages for which the t-cedilla form might be appropriate, so I've always mapped both U+0163 and U+021B to the same t-commaaccent glyph. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] Language must belong to the Other -- to my linguistic community as a whole -- before it can belong to me, so that the self comes to its unique articulation in a medium which is always at some level indifferent to it. - Terry Eagleton
Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)
Kenneth Whistler wrote as follows about my idea. It occurs to me that it is possible to introduce a convention, either as a matter included in the Unicode specification, or as just a known about thing, that if one has a plain text Unicode file with a file name that has some particular extension (any ideas for something like .uof for Unicode object file) ...or to pick an extension, more or less at random, say .html Well, that could produce confusion with a .html file used for Hyper Text Markup Language, HTML. I suggested .uof so that a .uof file would be known as being for this purpose. that accompanies another plain text Unicode file which has a file name extension such as .txt, or indeed other choices except .uof (or whatever is chosen after discussion) then the convention could be that the .uof file has on lines of text, in order, the name of the text file then the names of the files which contains each object to which a U+FFFC character provides the anchor. For example, a file with a name such as story7.uof might have the following lines of text as its contents. story7.txt horse.gif dog.gif painting.jpg This is a shaggy dog story, right? No, it is a story about an artist who wanted to paint a picture of a horse and a picture of a dog and, since he knew that the horse and the dog were great friends and liked to be together and also that he only had one canvas upon which to paint, the artist painted a picture of a landscape with the horse and the dog in the foreground, thereby, as the saying goes, painting two birds on one canvas, http://www.users.globalnet.co.uk/~ngo/bird0001.htm in that he achieved two results by one activity. In addition the picture has various interesting details in the background, such as a windmill in a plain (or is that a windmill in a plain text file). :-) The file story7.uof could thus be used with a file named story.txt so as to indicate which objects were intended to be used for three uses of U+FFFC in the file story7.txt, in the order in which they are to be used. Or we could go even further, and specify that in the story7.html file, the three uses of those objects could be introduced with a very specific syntax that would not only indicate the order that they occur in, but could indicate the *exact* location one could obtain the objects -- either on one's own machine or even anywhere around the world via the Internet! And we could even include a mechanism for specifying the exact size that the object should be displayed. For example, we could use something like: img src=http://www.coteindustries.com/dogs/images/dogs4.jpg; width=380 height=260 border=1 or img src=http://www.artofeurope.com/velasquez/vel2.jpg; Now that is a good idea. In a .uof file specifically for the purpose, a line beginning with a character could be used to indicate a web based reference, or a local reference, for the object, using exactly the same format as is used in an HTML file. If the line does not start with a character, then it is simply a file name in the same directory as the .uof file, as I suggested originally. This would mean that where, say, a .uof file were broadcast upon a telesoftware service that the Java program (also broadcast) analysing the file names in the .uof file need not necessarily be able to decode lines starting with a character so that the Java program does not need to have the software for that decoding in it, yet the same .uof file specification could be used, both in a telesoftware service and on the web, where a more comprehensive method of referencing objects were needed. I can imagine that such a widely used practice might be helpful in bridging the gap between being able to use a plain text file or maybe having to use some expensive wordprocessing package. And maybe someone will write cheaper software -- we could call it a browser -- that could even be distributed for free, so that people could make use of this convention for viewing objects correctly distributed with respect to the text they are embedded in. Indeed, except not call it a browser as the name is already in widespread use for HTML browsers and might cause confusion. Analysing a .uof file would be a much less computational task than analysing the complete syntax of HTML files. Yes, yes, I think this is an idea which could fly. --Ken Good. It is a solution which could be very useful for people writing programs in Java, Pascal and C and so on which programs take in plain text files and process them for such purposes as producing a desktop publishing package. Hopefully the Unicode Technical Committee will be pleased to add a .uof format file specification into the set of Unicode documents so that the U+FFFC code can be used in an effective manner. The idea could be that if a .uof file is processed then the rules of .uof files apply in that situation, so that if a .uof file is not being processed, then the rules for .uof files do not apply, therefore
Any day can be April 1st? (was: An idea for keeping U+FFFC usable)
From: William Overington [EMAIL PROTECTED] Could this be discussed at the Unicode Technical Committee meeting next week please? whoosh William, Please read Ken's message again. He was *talking* about HTML, and pointing out how all of these things are supported in browsers already. You will likely be kicking yourself when you see what the message was actually saying. :-) MichKa Michael Kaplan Trigeminal Software, Inc. -- http://www.trigeminal.com/
Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)
Yes, yes, I think this is an idea which could fly. --Ken Good. It is a solution which could be very useful for people writing programs in Java, Pascal and C and so on which programs take in plain text files and process them for such purposes as producing a desktop publishing package. Uhh, I think Ken's message was entirely sarcasm or some higher form of rhetorical humor whose obscure name slips my mind right now. The suggestion to use html as an extension was the give away - I was laughing out loud from that point on - his point was that the technology to do what you want already exists it is called HTML and it is displayed by browsers and so forth. Barry Caplan www.i18n.com
Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)
William Overington wrote, No, it is a story about an artist who wanted to paint a picture of a horse and a picture of a dog and, since he knew that the horse and the dog were great friends and liked to be together and also that he only had one canvas upon which to paint, the artist painted a picture of a landscape with the horse and the dog in the foreground, thereby, as the saying goes, painting two birds on one canvas, http://www.users.globalnet.co.uk/~ngo/bird0001.htm in that he achieved two results by one activity. In addition the picture has various interesting details in the background, such as a windmill in a plain (or is that a windmill in a plain text file). :-) 1) It's gif file format rather than plain text.* 2) There isn't any windmill. Best regards, James Kass, * P.S. - But, it's a nice gif file. In fact, aside from the absence of the windmill, it exceeded my expectations. -JK.
Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)
William, So let me see if I understand this correctly. Let's take 2 perfectly good standards, Unicode and HTML, and make some very minor tweaks to them, such as changing the meaning of U+FFFC and a special format for filenames in the beginning of the file and a new extension, so we have something new. Now the big benefit of this completely new thing, is that programs that do desktop publishing can use plain text files which are not quite plain text because they have some special formatting, but now they can publish them in better manner than before. For example, plain text with pictures. This is great. (It is true that it is less capable than if we had just used enough html to do the same thing, but .uof is more like plain text than html is.) Programmers will be happy because now they can support plain text with just a few tweaks. Oh I almost forgot, they also have to support Unicode, but slightly tweaked. And they can also support HTML, with some minor tweaks for .uof. Of course programmers don't mind supporting lots of variations of the same thing. Customer support personnel also don't mind. Oh, the plain text programmers will now need to support pictures and other aspects of full publishing, but at least they won't have a complex file format to work with. I guess it doesn't matter that a more complex format is also more expressive and therefore can leverage all of the publishing features. It probably doesn't matter that a desktop publishing product probably already supports more complex formats, and probably also supports html, it will be beneficial to add this slight difference from plain text. I like this very much. It is very much like when the magician slides the knot in the string and makes it disappear. I imagine that over time we will have some more wonderful inventions and add further tweaks and further improve the publishing of plain text. There are a few other things I would like to improve in Unicode, so I hope it will be ok to make some other suggestions. We can change the extention to know which tweaks we are talking about. .uo1, .uo2. Just a few small changes to characters and plain text format variations. Stability of the meaning of the file isn't important. However, I think my first suggestion will be to make the benefits of .uof available to XML. We can all this .uo1. I am a little disconcerted that html already can do everything that .uof does plus more, and is also supported by all of the publishers that are like to support .uof. Also, as there are more than a million characters in Unicode, most are unused so far, so changing the meaning of just FFFC in this one context doesn't seem like a big win, considering also every line of code that might work with FFFC now needs to consider the context to determine its semantics. But every invention deserves to be implemented, we need not look at whether the invention satisfies some demand of its customers. I like the 2 birds picture and I assume it was a metaphor for the idea- one bird was html the other unicode. I was a little disappointed that you used html instead of .uof format though. Maybe its the lateness of the hour here. I hope the idea looks as good in the morning. Oh I almost forgot. I was having difficulty discerning when you and Ken might be joking. The mails read very serious. I would like to suggest we make a new format .uo2. We can indicate line numbers and emotions with plain text characters that look like facial expressions. It would help me know when you both were serious and when you might be joking. Sometimes it is hard to tell. I am going to create a list of facial expressions and assign them in the PUA so we can all have a standard to follow. See my next mail with a list of facial expressions and assignments. tex William Overington wrote: Kenneth Whistler wrote as follows about my idea. It occurs to me that it is possible to introduce a convention, either as a matter included in the Unicode specification, or as just a known about thing, that if one has a plain text Unicode file with a file name that has some particular extension (any ideas for something like .uof for Unicode object file) ...or to pick an extension, more or less at random, say .html Well, that could produce confusion with a .html file used for Hyper Text Markup Language, HTML. I suggested .uof so that a .uof file would be known as being for this purpose. that accompanies another plain text Unicode file which has a file name extension such as .txt, or indeed other choices except .uof (or whatever is chosen after discussion) then the convention could be that the .uof file has on lines of text, in order, the name of the text file then the names of the files which contains each object to which a U+FFFC character provides the anchor. For example, a file with a name such as story7.uof might have the following lines of text as its contents. story7.txt horse.gif dog.gif painting.jpg
Re: Furigana
On 08/14/2002 05:53:58 AM James Kass wrote: Once a meaning like INTERLINEAR ANNOTATION ANCHOR has been assigned to a code point, any application which chooses to use that code point for any other purpose would be at fault. Since it's for internal use only, nobody would ever know. Unicode conformance must always be understood in terms of what happens externally, between two processes, or between a process and a user. What goes on inside doesn't matter as long as it is conformant on the outside. If my program includes a portion of code that interprets all USVs as jelly-bean flavours but doesn't let any symptoms of that leak outside, I haven't voilated any conformance requirement. In other words, if these characters are to be used internally for Japanese Ruby (furigana), etc., then they ought to be able to be used externally, as well. They simply aren't adequate for anything more than the simplest of cases. Moreover, the recommdations of TR#20 / the W3C character model clearly indicate that markup is to be preferred for applications like this. Because it seems to be an oxymoron. I think most would agree that that's clear now, but it wasn't always understood so clearly. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Tildes on vowels
On 08/13/2002 10:08:00 AM William Overington wrote: I've been ignoring the list for a few days, but come back to find that not much has changed. 2) Superscript, subscript, combining above, and other forms of identifying placement of characters, are better left to markup or other rendering systems and file formats (and not for a vehicle intended for plain text.) Why? This call for markup seems to be some deeply held belief that is treated as if it is a law of nature. So, some people somewhere decided to think in terms of layers, so, that is up to them: the fact of the matter is that using individual Private Use Area characters for matters which are otherwise performable by a sequence of characters starting with a character used to mean ENTER MARKUP BUBBLE rather than its specified meaning in the Unicode standard is perfectly reasonable. While you make comments disparaging layers and markup, you don't seem to realise that your own solutions are actually equivalent. They simply replace the industry-standard and widely-adopted conventions of XML using multi-character sequences like sup.../sup with single-character sequences using non-standard, *private*-use characters. Both solutions involve layers; both solutions use markup. The only differences are - one uses character sequences with start and end delimiters, while the other uses single characters with point-like effect (their scope is implicitly delimited) - one is a widely-adopted industry standard that has a large number of implementations, and the other is merely a proposal entertained by a few individuals. I'm sure someone has pointed this out already some while ago. I am not knocking markup, I am simply saying that there is a choice of ways to do things and that sometimes a direct Private Use Area encoding is a good choice. You'd better not be knocking markup since you're simply introducing a different markup convention. Please recognise and acknowledge this. then Stefan's suggested characters might be very useful, particularly if they happen to be in a part of the Private Use Area not used for anything else This sounds to me like complete nonsense! Everyone must assume that the *entire* PUA is used for something else by somebody. That's the rules of the PUA. Case in point: I have a use of the PUA that involves every single PUA codepoint, and it is entirely different from Stefan's suggested character and any other character you or anybody else on this list has ever (to my recollection) suggested for the PUA. It involves using PUA codepoints to stand for rational numbers in the sequence 0.5, 0.25, 0.125... 2^-125068. Name any PUA codepoint, and I can tell you what it represents in this private system of mine. (Valid use? Yes. Good use? Perhaps in some specific -- but as yet unidentified -- processing contexts, but generally, not really. Worth adopting by others? No.) Or perhaps you mean, in a part of the PUA *I* haven't yet used. If you're meaning your own use of the PUA, then please say so, and don't speak in general terms that sound like there's one common use for the PUA. That is true, yet I was not suggesting that. I am suggesting that within a specialised area of activity, namely transcribing documents and sharing the transcriptions with others who are aware of the technique being used, that such a Private Use Area usage could be of value. That's valid. But the discussion of specific uses of the PUA for those purposes really should be addressed specifically to a group of people that have such a need and wish to use a common convention so that they can interchange data amongst themselves. In short, the proposals do not solve existing problems(1,2,3), conflict with the current architecture (4,5), have problems themselves (5) and so are not enticing. Well, perhaps this needs to be reconsidered in the light of the above comments. Reconsidering, the proposals are valid within a *private* group of users needing such a solution and needing to interchange data amongst themselves. If you try to expand the target group of users beyond that, then it is not good for the reasons that were presented. So, both points of view are valid in relation to different contexts (one specific, the other more general) -- and only those contexts. Indeed, in relation to the declared aims of this mailing list, I feel that discussion of Private Use Area uses in this list is directly on-topic. The only problem is that when you talk about the PUA, you tend to express things in a way that makes it sound to others as though you mean for those proposed uses to apply to a wide group of users. Perhaps that's not what you intend, but I believe that's the way many perceive it. The very fact that you offer to the list to assign PUA characters and publish details when others suggest some idea for a private character contributes to this: such an offer isn't necessary since not everyone on this list is
Re: Double Macrons on gh (was Re: Tildes on Vowels)
On 08/14/2002 02:36:37 PM William Overington wrote: U+0360 COMBINING DOUBLE TILDE U+035D COMBINING DOUBLE BREVE U+035E COMBINING DOUBLE MACRON U+035F COMBINING DOUBLE LOW LINE I also note U+0361 COMBINING DOUBLE INVERTED BREVE and U+0362 COMBINING DOUBLE RIGHTWARDS ARROW BELOW in the code chart. I wonder if someone could please clarify how an advanced format font would be expected to use such codes. In a dumb font, support for these character can be implemented by having a glyph that has zero advance width, with the outline extending beyond both side-bearings. In a smart font, one could position the glyph for one of these combining marks using attachment points (i.e. the outline of the glyph for the base character includes a target point, and the outline for the combing mark includes a specific point that the layout engine aligns over the target point), or one could look for certain base + combining mark combinations and substitute the sequence of glyphs for a single composite glyph. The latter approach has limitations in that you have to choose ahead of time exactly which combinations you will support, and there can only be a limited number of such combinations. Attachment points, in general, have the advantage that they can be designed to work with arbitrary combinations -- any possible combination. With the double-width combining marks, though, things are rather trickier. First, you may need to substitute a variant glyph for the combining mark that has a width to match the particular pair of base characters -- potentially quite messy; and then you have to deal with positioning in relation to two base characters at once, which has additional complexity. For instance, when positioning a double macron over (say) la, you need to adjust the height to the taller of the two glyphs; but you need to make the same adjustment for al. One of my co-workers implemented such behaviour in a font using Graphite a couple of years ago; my recollection is that there isn't an easy way to accomplish this with OT, but I haven't worked with OT enough to know for sure. I understand from an earlier posting in this thread that the format to use in a Unicode plain text file would be as follows. first letter then combining double accent then second letter Yes. As first letter and second letter could be theoretically almost any other Unicode characters, would the approach be to just place all three glyphs superimposed onto the screen and hope that the visual effect is reasonable That's one possibility, what I would refer to as the dumb rendering implementation. or would a font have a special glyph within it for each of the permutations of three characters which the font designer thought might reasonably occur yet default to a superimposing of three glyphs for any unexpected permutation which arises? This is a possible implementation in a smart-font rendering context. As a matter of interest, how many characters are there where such double accents are likely to be used please? Is it just a few or lots? This really isn't easy to answer. Someone could tell you, these 29 combinations... but they might not -- probably do not -- know about what every user in the world might have ever needed or will ever need. While in this general area, could someone possibly say something about how and why U+034F COMBINING GRAPHEME JOINER is used please? Please read the relevant portions of the standard (see on section 13.2 in clause IV of TR#28), and then come back with questions for clarification, if needed. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
RE: Furigana
On 08/14/2002 10:52:32 AM Michael Everson wrote: I'm saying I WANT to use these characters. They solve an apparent need of mine They only *appear* to you to solve that need, but in fact do not offer a good solution. Markup is recommended for your need. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)
On 08/14/2002 02:04:50 PM William Overington wrote: As this concerns the U+FFFC character and the Unicode Technical Committee is due to meet next week, I think it might be helpful if this idea is discussed before the meeting as a straightforward idea like this might mean that the possibility to exchange U+FFFC characters at all if people want to do so is not lost. This does not solve any problems not already solved. This is not plain text; it is a form of interchange markup and a higher-level protocol. There are already higher-level markup protocols that accomplish this. The standard already specifies that FFFC should not be exported from an application or interchanged. There is no reason to change this. Everybody will welcome the new conventional, graphical-type characters and scripts that are coming with Unicode 4.0. What are those please? See the Proposed characters section of the Unicode site. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Double Macrons on gh (was Re: Tildes on Vowels)
On 08/14/2002 04:34:27 PM Doug Ewell wrote: Broad ranges of Planes 0 and 1 have been tentatively blocked out on the Roadmap for RTL scripts. Oh? I was somewhat sharply rebuked a few years for suggesting that such a thing be done. References to relevant documentation, please? - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: RE: Furigana
On 08/14/2002 01:16:29 AM starner wrote: That seems to be basically what William Overington is proposing, except these characters only handle furigana, instead all markup. Not quite. WO has proposed characters to be used in interchange. These are only intended for internal use by programmers. They are exactly like the non-characters at FDD0..FDEF except that these were named to a specific function (as was FFFC -- also an internal-use code with a specifically-named function). - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Double Macrons on gh (was Re: Tildes on Vowels)
At 09:38 +0100 2002-08-16, [EMAIL PROTECTED] wrote: On 08/14/2002 04:34:27 PM Doug Ewell wrote: Broad ranges of Planes 0 and 1 have been tentatively blocked out on the Roadmap for RTL scripts. Oh? I was somewhat sharply rebuked a few years for suggesting that such a thing be done. References to relevant documentation, please? We kept like with like in the Roadmap. Nobody rebuked us. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Gutenberg's ligatures (spins off from Re: Tildes on vowels)
Michael Everson wrote, Appropriate font technology for Latin ligature display exists, but it isn't enabled yet in Microsoft's Uniscribe.* That doesn't mean that this particular cataloguing of ligatures in the PUA is a good idea. The Golden Ligatures Collection simply offers font developers and end users an opportunity to make use of some rather interesting ligatures in a consistent, although non-standard, fashion. That doesn't make it a good idea. From the Adobe Glyph List at http://partners.adobe.com/asn/developer/type/glyphlist.txt quote # 1.0 [17 Jul 1997] Original version # 0041;A;LATIN CAPITAL LETTER A 00C6;AE;LATIN CAPITAL LETTER AE 01FC;AEacute;LATIN CAPITAL LETTER AE WITH ACUTE F7E6;AEsmall;LATIN SMALL CAPITAL LETTER AE 00C1;Aacute;LATIN CAPITAL LETTER A WITH ACUTE F7E1;Aacutesmall;LATIN SMALL CAPITAL LETTER A WITH ACUTE ... /quote Small caps get assigned in the PUA in published lists, why not other presentation forms, too? Plenty of precedent exists. This may not be a good idea from an encoding standpoint, but, right now this is a display issue. OpenType technology should eventually enable variants to display even when correct text encoding is used, but it doesn't work yet. The Cardo font has presentation forms in the PUA area and so do the Junicode and Code2000 fonts. Lots of fonts do. As a font designer, you probably can understand a desire to be able to display a glyph once it is drawn. If a designer puts a glyph in a font without providing a user with any way to display the glyph; the designer might as well not have troubled. Best regards, James Kass.
Re: New version of TR29:
Mark Davis wrote: There is a new version of Unicode Technical Report #29: Text Boundaries on http://www.unicode.org/reports/tr29/, covering grapheme-cluster, word and sentence boundaries. There are significant modifications to this version; for a summary, see http://www.unicode.org/reports/tr29/#Modifications. This is a draft version, not a final version. There are a number of open issues remaining. Feedback is welcome Feedback that is received before the UTC meeting (starting August 20) can be made available for the discussion of TR29 at that meeting. FYI: There're an open issue regarding grapheme-cluster boundaries in Thai. * SARA AM as an Other_Grapheme_Extend? Whether 0E33;THAI CHARACTER SARA AM should be a GraphemeExtend character or not? By Unicode definition, SARA AM is an Lo, not a combining character. But many Thai applications (MS Office/ Windows/ OpenOffice.org) treats SARA AM like a combining character (unlike SARA AA), i.e. cursor always jump over it. Whether this is right or not is controversial but the fact is that Windows users are used to it. My personal question is that, if it is favorable for Thai to treat SARA AM as part of the previous grapheme cluster, is it possible for UTC to consider adding SARA AM as an Other_Grapheme_Extend? --- I also notice that Grapheme_Link is removed from the grapheme-cluster definition. This is appropriate for Thai because PHINTHU should not cause two grapheme clusters to be linked together. -- Feel free to disclose the contents of this message. Regards, Samphan Raruenrom Information Research and Development Division, National Electronics and Computer Technology Center, Thailand. http://www.nectec.or.th/home/index.html
Re: Furigana
[EMAIL PROTECTED] wrote: On 08/14/2002 12:45:22 AM Kenneth Whistler wrote: But even at the time, as the record of the deliberations would show, if we had a more perfect record, the proponents were clear that the interlinear annotation characters were to solve an internal anchor point representation problem. I recall at the UTC meeting in Jan 2000 (I think it was 2000) there was discussion of adding non-character code points for internal use by programmers, and I remember Tex suggesting that it might be better to identify the specific functions for which internal-use codepoints might be needed, as had been done in the case of things like the IA characters. In other words, at that time, it seems that they were understood by everyone present to be intended for internal use by programmers only. Peter's made the point that for internal use was understood which is fine. Let me add, that my concern with internal-use code points not having specific functions, is that we now live in a world where software applications often use third party components (various drivers, shared libraries, OCXs, DLLs, etc.) internally. Having internal-use code points, which may not be treated with the right semantics by 3rd parties that have been integrated with internally, is problematic. You should be careful and avoid passing these internal-use code points to third parties, but this greatly inhibits their use, or makes for an awkward and not easily extensible architecture. At the time (in the discussion), I don't think we had many examples of what the uses would be, and it wan't clear that many were needed, since the functionality could be arrived at with higher level protocols. So to be clear, when internal-use code points are used, not only do they need to be filtered from external exchanges, you need to be very clear about your internal architecture and make sure you don't call a system function or third party function that might mistreat the i-u. c. p or worse barf at it. (Anyway, I think that's what I was thinking at the time. I have trouble remembering what I said yesterday much less the last millenia.) tex -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: The existing rules for U+FFF9 through to U+FFFC. (spins from Re: Furigana)
Kenneth Whistler replied to my posting as follows. An interesting point for consideration is as to whether the following sequence is permitted in interchanged documents. U+FFF9 U+FFFC U+FFFA Temperature variation with time. U+FFFB That is, the annotated text is an object replacement character and the annotation is a caption for a graphic. Yes, permitted. Great. That may well be useful for free to the end user distance education using telesoftware upon digital television channels. A .uof file (as in the thread An idea for keeping U+FFFC usable. ) could be used with a Unicode plain text file of some learning material over the broadcast link and a Java program (also broadcast) could place the pictures with their captions in the correct place in the text. As would also be: U+FFF9 U+FFFC U+FFFC U+FFFA U+FFF9 Temperature U+FFFA a measure of hotness, related to the U+FFF9 kinetic energy U+FFFA energy of motion U+FFFB of molecules of a substance U+FFFB U+FFF9 variation U+FFFA rate of change U+FFFB with time U+FFFC . U+FFFB Where the first U+FFFC is associated with a URL with a realtime data feed, the second U+FFFC is a jar file for a 3-dimensional dynamic display algorithm, and the third U+FFFC is a banner ad for Swatch watches. Thank you for this example. I have analysed it thoroughly using Notepad by going to a new line and indenting at each occurrence of U+FFF9 and going to a new line and indenting at each occurrence of U+FFFA, and going to a new line and placing each U+FFFB beneath the corresponding U+FFFA. For each U+FFFC I went to a new line, and placed the U+FFFC beneath the most recent U+FFF9 or U+FFFA character. In addition, after each U+FFF. character, for ordinary text, I went to a new line and indented so that the next ordinary text character was beneath the U of the most recently entered U+FFF. character, except that after a U+FFFB the indentation went back two indentation levels. After each U+FFFC character, and on the same line, I added the details of the object within parentheses. This gave the following. U+FFF9 U+FFFC (URL with a realtime data feed) U+FFFC (jar file for a 3-dimensional dynamic display algorithm) U+FFFA U+FFF9 Temperature U+FFFA a measure of hotness, related to the U+FFF9 kinetic energy U+FFFA energy of motion U+FFFB of molecules of a substance U+FFFB U+FFF9 variation U+FFFA rate of change U+FFFB with time U+FFFC (banner ad for Swatch watches) . U+FFFB This took me quite some time to figure out, and was indeed an interesting challenge. It seems to me that if that is indeed permissible that it could potentially be a useful facility. I was referring to my original example, not to your example! :-) Permissible does not imply useful, however, in this case. That's referring to your example when you refer to this case is it? :-) It is unlikely that you are going to have access to software that would unscramble such layering in purported plain text, even if you had agreements with your receivers. Hmm? Yet, it is not the example to which I referred. The example to which I referred has not been commented upon as to its practical feasibility has it? However, is your example that difficult if someone set his or her mind to it? Consider for example that the software which does the unscrambling were to have its own internal list of annotation facilitating characters so that it assigned, for each page of the final rendered text, the characters in the list of annotation facilitating characters in order for each U+FFF9 U+FFFA pairing wherever the U+FFF9 item to be annotated were other than just one or more U+FFFC characters. The list of annotation facilitating characters could be something like U+002A, U+2020, U+2021, U+2051, that is, asterisk, dagger, double dagger, two asterisks aligned vertically. The annotation facilitating character is then placed both after the annotated item and before the annotation, wherever that may be on the page, such as in a footnote. I am not suggesting that an algorithm for such is quickly programmable, yet it does not seem on the face of it to be as unlikely to be possible as your comment might perhaps seem to imply. That is what markup and rich text formats are for. Well, maybe for your example, yet for my example a plain text file for the main text together with a .uof file to state
Re: Discrepancy between Names List Code Charts?
John Hudson scripsit: The newish Gagauz Turkish Latin-script orthography derives from both Turkish and Romanian models. This has led to a peculiar hybrid, in which the cedilla is used for the s and the commaaccent is used for the t. ME's remarks in _The Alphabets of Europe_ seem downright bizarre to me: # Note that in # Romania, Gagauz uses the characters S WITH COMMA BELOW and T WITH COMMA BELOW. # In inferior Gagauz typography, the glyphs for these characters are sometimes # drawn with CEDILLAs, but it is strongly recommended to avoid this practice. # However, because Gagauz is a Turkic language, it may be left to the user to # decide whether S WITH COMMA BELOW (as in Romanian) or S WITH CEDILLA (as in # Turkish) is preferred. It seems that the last two sentences say that it may be left to the user to decide whether inferior or superior typography is preferred. -- De plichten van een docent zijn divers, John Cowan die van het gehoor ook. [EMAIL PROTECTED] --Edsger Dijkstra http://www.ccil.org/~cowan
Re: Furigana
Tex Texin scripsit: At the time (in the discussion), I don't think we had many examples of what the uses would be, and it wan't clear that many were needed, since the functionality could be arrived at with higher level protocols. One application that has always seemed obvious to me is regular expressions: a compiled regular expression can be represented by a Unicode string, with non-characters representing things like any character, zero or more, one or more, beginning of string, end of string, etc. etc. -- John Cowan [EMAIL PROTECTED] http://www.ccil.org/~cowan One time I called in to the central system and started working on a big thick 'sed' and 'awk' heavy duty data bashing script. One of the geologists came by, looked over my shoulder and said 'Oh, that happens to me too. Try hanging up and phoning in again.' --Beverly Erlebacher
RE: OCR characters
I believe, Eric is talking about the characters on the attached page 8 of the OCR standard. Regards Arnold -Original Message- From: Eric Muller [mailto:[EMAIL PROTECTED]] Sent: Thursday, August 15, 2002 7:44 PM To: [EMAIL PROTECTED] Subject: OCR characters In our OCR fonts, we have two glyphs named erase (looks like a black square) and grouperase (looks like a long dash). I don't have a copy of the OCR standards, but I suspect those are mandated by these standards. On the other hand, and I can't find traces of those in Unicode, so I suspect they have been unified. But with which characters? More generally, are there other things like that we should aware of? Thanks, Eric. Page-8-OCR-B.pdf Description: Binary data
Re: Discrepancy between Names List Code Charts?
On Fri, 16 Aug 2002, John Cowan wrote: John Hudson scripsit: The newish Gagauz Turkish Latin-script orthography derives from both Turkish and Romanian models. This has led to a peculiar hybrid, in which the cedilla is used for the s and the commaaccent is used for the t. ME's remarks in _The Alphabets of Europe_ seem downright bizarre to me: # Note that in # Romania, Gagauz uses the characters S WITH COMMA BELOW and T WITH COMMA BELOW. # In inferior Gagauz typography, the glyphs for these characters are sometimes # drawn with CEDILLAs, but it is strongly recommended to avoid this practice. # However, because Gagauz is a Turkic language, it may be left to the user to # decide whether S WITH COMMA BELOW (as in Romanian) or S WITH CEDILLA (as in # Turkish) is preferred. It seems that the last two sentences say that it may be left to the user to decide whether inferior or superior typography is preferred. -- De plichten van een docent zijn divers, John Cowan die van het gehoor ook. [EMAIL PROTECTED] --Edsger Dijkstra http://www.ccil.org/~cowan Friday, August 16, 2002 If fools such as I who know no Gagauz may rush in: It seems to me that reading is learned habit. When different people learned to read Gagauz they may have learned to expect different forms of glyphs because that's what they were taught. Assuming teaching different conventions isn't based on an evil intent to pervert the minds of children, differing conventions are not bad only different. It may be that such different conventions will gradually evolve to one but I think Unicode would be wise to avoid attempting to impose standards on how written text appears and should instead aim to facilitate presentation of text legible to the conventions of current readers. We all live with two forms of lower case t (with and without the curved bottom) and lower case g (with and without the closed descender). It's possible these different conventions will disappear but until they do some will want one and some will want the other and I would hope Unicode could permit rendering software to provide either. Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) It is not true that people stop pursuing their dreams because they grow old, they grow old because they stop pursuing their dreams. Adapted from a letter by Gabriel Garcia Marquez. The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A. Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.
some cedillas
The Times Atlas of the World uses t-cedilla, d-cedilla, and h-cedilla in transcriptions of Yemen placenames. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: OCR characters
Eric Muller had written: In our OCR fonts, we have two glyphs named erase [...] and grouperase [...] I suspect those are mandated by these standards. On the other hand, and I can't find traces of those in Unicode, Arnold F. Winkler wrote: I believe, Eric is talking about the characters on the attached page 8 of the OCR standard. I don't have ISO 1073 at hand, only the German - DIN 66 008 (Jan 1978), which is essentially identical with ISO 1073/I-1976, and - DIN 66 009 (Sept. 1977), which is based on, but not identical with, ISO 1073/II-1976. DIN 66 008 contains the figure reported by Arnold Winkler. This standard does not specify the intended usage of these characters -- not beyond their expressive names. DIN 66 009 says about the equivalent OCR-B characters (my translation): In case of a typo, a keyboard-driven device will print the Character Erase on top of an erroneous character. This will cause the OCR reading device to ignore this position. The Group Erase may be either drawn by hand, or printed as discussed in the previous paragraph. It will cause the OCR reading device to ignore this position. So, these characters would never be read by an OCR device. They would be printed only in response to a function key (such as Erase Backwards), but never sent (encoded as characters) to a device. This means, that they will not normally be encoded, hence there will probably no need to assgin Uni- codes to them. The only exception could be a text discussing these characters, and their usage. I think, this sort of text would use figures rather than characters, to show the effect of overprinting in several variants. (The Erase, and the erased, character's positions may slightly differ.) So I guess, these characters are deliberately left off Unicode. Best wishes, Otto Stolz
Re: some cedillas
Michael Everson scripsit: The Times Atlas of the World uses t-cedilla, d-cedilla, and h-cedilla in transcriptions of Yemen placenames. But is it correct? The National Geographic map on my wall uses s-cedilla in Romanian place names, and that's definitely wrong. -- Knowledge studies others / Wisdom is self-known; John Cowan Muscle masters brothers / Self-mastery is bone; [EMAIL PROTECTED] Content need never borrow / Ambition wanders blind; www.ccil.org/~cowan Vitality cleaves to the marrow / Leaving death behind.--Tao 33 (Bynner)
Re: some cedillas
At 10:58 -0400 2002-08-16, John Cowan wrote: Michael Everson scripsit: The Times Atlas of the World uses t-cedilla, d-cedilla, and h-cedilla in transcriptions of Yemen placenames. But is it correct? The National Geographic map on my wall uses s-cedilla in Romanian place names, and that's definitely wrong. The Times Atlas does use t-comma-below with Romanian placenames. Whether Times practice is correct for transliterating Arabic I couldn't say, but it's what they are doing. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)
James Kass wrote as follows. William Overington wrote, No, it is a story about an artist who wanted to paint a picture of a horse and a picture of a dog and, since he knew that the horse and the dog were great friends and liked to be together and also that he only had one canvas upon which to paint, the artist painted a picture of a landscape with the horse and the dog in the foreground, thereby, as the saying goes, painting two birds on one canvas, http://www.users.globalnet.co.uk/~ngo/bird0001.htm in that he achieved two results by one activity. In addition the picture has various interesting details in the background, such as a windmill in a plain (or is that a windmill in a plain text file). :-) 1) It's gif file format rather than plain text.* 2) There isn't any windmill. The picture of the birds has been in our family webspace since 1998 as an illustration for the saying Painting two birds on one canvas. That saying, originated by me, is a peaceful saying meaning to achieve two results by one activity. I made the picture from clip art as a learning exercise. The picture of the birds is referenced as a way of illustrating the saying Painting two birds on one canvas. It is not the picture in the story about which Ken asked. I may well have a go at constructing such a picture, perhaps using clip art. The reference to a windmill is meant as a humourous aside to Don Quixote tilting at windmills. I am interested in creative writing, so when Ken asked about the story, I just thought of something to put in my response. Part of the training in, and the fun of, creative writing is to be able to write something promptly to a topic. William Overington 16 August 2002
Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)
Tex Texin wrote as follows. William, So let me see if I understand this correctly. Let's take 2 perfectly good standards, Unicode and HTML, Yes. and make some very minor tweaks to them, No. such as changing the meaning of U+FFFC and a special format for filenames in the beginning of the file and a new extension, so we have something new. I have suggested no changes whatsoever to HTML at all. The only thing which I have suggested in relation to Unicode in this thread is that, in relation to the fact that information about the object to which any particular use of U+FFFC refers is kept outside the character data stream, that it could be a good idea to define a file format .uof so that details of the names of the files for which the U+FFFC codes are anchors could be provided in a known format, if and only if end users chose to use a .uof file for that purpose on that occasion and not otherwise. This was in the context of seeking to protect the use of U+FFFC as a character which could be used in interchanging of documents following from the discussion of U+FFFC and annotation characters in the thread from off of which I spun this thread, which discussion, by Ken and Doug, is repeated in the first posting of this present thread. I thought it a good idea that the Unicode Technical Committee might like to make such a .uof file format an official Unicode document so as to offer one possible way to use U+FFFC codes. That is now a matter for discussion. If the Unicode Consortium wishes to do that, then fine. If the Unicode Consortium chooses not to do that, then I can write it up myself and publish it, which is not such a good solution, yet is adequate for my own needs and might be useful for some other people if they choose to use the same format for .uof files. Hopefully I have now managed to raise the issue of protecting the fact that the U+FFFC character can be used in document interchange and it will hopefully not become deprecated to the status of a noncharacter. There is a practical reason for this, which is, from my own perspective, quite important. This is as follows. The DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform) system (details at http://www.mhp.org ) which implements my telesoftware invention. A Java program which has been broadcast can read a Unicode plain text file and act upon the characters within it, and can read other file formats, such as .png files (Portable Network Graphics) and act upon the information in those files, so as to produce a display. So, a collection of files, namely a .uof file in the format that I suggested it, a Unicode plain text file with one or more U+FFFC characters in it and the appropriate graphics files in .png format as a package of free to the end user distance education learning material being broadcast from a direct broadcasting satellite or a terrestrial transmitter could be a very useful facility as the way to carry text with illustrations. Using HTML and a browser is just not the way to proceed in that situation. HTML and a browser is a very useful technique for the web and indeed is an option for the DVB-MHP system, yet the basic software system is Java based. It is as if the television set is acting as a computer which has a slow read only access disc drive in the sky from which it may gather information, including software. The system is interactive with no return information link to the central broadcasting computer, by means of the telesoftware invention. Overlays and virtual running with programs bigger than the local storage being able to be run using chaining techniques are possible. Please do not think of this as downloading as no uplink request is made! Now the big benefit of this completely new thing, Well, it's only a way of sender and receiver being able to have information in a file with the suffix .uof about what objects are being anchored by U+FFFC codes in a Unicode plain text file which it accompanies. is that programs that do desktop publishing can use plain text files which are not quite plain text because they have some special formatting, Well, the plain text files are only Unicode plain text which might contain one or more U+FFFC characters and some of the other Unicode control characters such as CARRIAGE RETURN. but now they can publish them in better manner than before. Well, my thinking is that it would help to have a well known way to express the meaning of the anchors encoded by U+FFFC in a file rather than having only a vague specification that all other information about the object is kept outside the data stream. I am saying that, yes, all other information about the object is kept outside the data stream and, if, and only if, end users choose to use a .uof file in a standard format to convey that information for some particular use of a U+FFFC code, then that format could be considered for definition and publication by the Unicode Consortium. That does not seem unreasonable to me.
Re: Double Macrons on gh (was Re: Tildes on Vowels)
Peter_Constable at sil dot org wrote: Broad ranges of Planes 0 and 1 have been tentatively blocked out on the Roadmap for RTL scripts. Oh? I was somewhat sharply rebuked a few years for suggesting that such a thing be done. References to relevant documentation, please? It looks like the dog ate my homework. The Roadmap pages I was referring to: http://www.unicode.org/roadmaps/bmp-3-7.html (for Plane 0) http://www.unicode.org/roadmaps/smp-3-3.html (for Plane 1) no longer contain the gray-shaded areas indicating where RTL scripts are, or were, *TENTATIVELY* blocked out. The BMP page does still contain a note explaining this convention: Areas containing RTL scripts, as well as the Surrogates Zone and the Private Use Zone are shaded grey here informatively. but they aren't any more. Also, the links to PDF versions are broken, so I can't tell whether the PDF files still contain gray blocks or not. I would guess a claim that we could absolutely, positively guarantee that characters in a particular range would always be RTL would earn a rebuke. I was just going by what the Roadmaps (used to) say, and that's why I referred specifically to the Roadmaps and used the word tentatively. Should've double-checked first, though. -Doug Ewell Fullerton, California
Re: Furigana
John, Why would you want them to be for internal-use only? When you exchange regular expressions wouldn't you want operators such as any character to be passed as well, and standardized so that there is agreement on the meaning of the expression? It is also not clear to me that it is desirable to encode operators of regular expressions as individual characters, because then you get into the slippery slope of encoding operators for every function that someone might want, and that is what started this thread isn't it... (But a Unicode APL operator set would be nice. ;-) ) tex John Cowan wrote: Tex Texin scripsit: At the time (in the discussion), I don't think we had many examples of what the uses would be, and it wan't clear that many were needed, since the functionality could be arrived at with higher level protocols. One application that has always seemed obvious to me is regular expressions: a compiled regular expression can be represented by a Unicode string, with non-characters representing things like any character, zero or more, one or more, beginning of string, end of string, etc. etc. -- John Cowan [EMAIL PROTECTED] http://www.ccil.org/~cowan One time I called in to the central system and started working on a big thick 'sed' and 'awk' heavy duty data bashing script. One of the geologists came by, looked over my shoulder and said 'Oh, that happens to me too. Try hanging up and phoning in again.' --Beverly Erlebacher -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: Mac OS X Keyboard Layouts (was Re: new version of Keyman)
Arsa Deborah Goldsmith [EMAIL PROTECTED]: There is lots of good news about keyboards in Mac OS X 10.2, none of Thank you for that rapid, if intriguing response, Deborah. which I'm allowed to discuss until August 24, unfortunately. If you have signed an Apple non-disclosure agreement, write me privately and I have (signed many an Apple non-disclosure agreement), the first of which over a decade ago, established EGT's symbiotic relationship with Apple and enabled my series of translations of generations of Apple Mac operating systems into Irish - not to mention several much-loved Claris products in the interim - here's a big 'hi' to any ex-Claris people reading this.:-) Please write to me privately, as one always bound by those agreements, Deborah. If you could answer, as well, another question of great importance to my local community, I'd appreciate that, Deborah - the question is (given that EGT fostered/financed the development and distributed free-of-charge via its own site for so many years the keyboards made in-house here to serve many small linguistic communities), will Apple's new keyboards (including those for the 'Celtic' languages) be free of charge to users (that is, will EGT's policy of not charging end-users a penny for their use be continued)? I hope it will, mg I'll blab about all of it. :-) I will be discussing all this and more at the San Jose Unicode conference, which, thankfully, is after August 24. I will try to post something on August 24 giving the basics. Deborah Goldsmith Manager, Fonts Unicode Apple Computer, Inc. [EMAIL PROTECTED] -- Marion Gunn * EGT (Estab.1991) * http://www.egt.ie * fiosruithe/enquiries: [EMAIL PROTECTED] * [EMAIL PROTECTED] *
Re: some cedillas
At 06:57 AM 16-08-02, Michael Everson wrote: The Times Atlas of the World uses t-cedilla, d-cedilla, and h-cedilla in transcriptions of Yemen placenames. I would expect those cedillas to be dots below the letters for standard Arabic transliteration. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] Language must belong to the Other -- to my linguistic community as a whole -- before it can belong to me, so that the self comes to its unique articulation in a medium which is always at some level indifferent to it. - Terry Eagleton
Re: Mac OS X Keyboard Layouts (was Re: new version of Keyman)
Arsa Deborah Goldsmith [EMAIL PROTECTED]: There is lots of good news about keyboards in Mac OS X 10.2, none of Thank you for that rapid, if intriguing response, Deborah. which I'm allowed to discuss until August 24, unfortunately. If you have signed an Apple non-disclosure agreement, write me privately and I have (signed many an Apple non-disclosure agreement), the first of which over a decade ago, established EGT's symbiotic relationship with Apple and enabled my series of translations of generations of Apple Mac operating systems into Irish - not to mention several much-loved Claris products in the interim - here's a big 'hi' to any ex-Claris people reading this.:-) Please write to me privately, as one always bound by those agreements, Deborah. If you could answer, as well, another question of great importance to my local community, I'd appreciate that, Deborah - the question is (given that EGT fostered/financed the development and distributed free-of-charge via its own site for so many years the keyboards made in-house here to serve many small linguistic communities), will Apple's new keyboards (including those for the 'Celtic' languages) be free of charge to users (that is, will EGT's policy of not charging end-users a penny for their use be continued)? I hope it will, mg I'll blab about all of it. :-) I will be discussing all this and more at the San Jose Unicode conference, which, thankfully, is after August 24. I will try to post something on August 24 giving the basics. Deborah Goldsmith Manager, Fonts Unicode Apple Computer, Inc. [EMAIL PROTECTED] -- Marion Gunn * EGT (Estab.1991) * http://www.egt.ie * fiosruithe/enquiries: [EMAIL PROTECTED] * [EMAIL PROTECTED] *
Re: Furigana
Tex Texin scripsit: Why would you want them to be for internal-use only? When you exchange regular expressions wouldn't you want operators such as any character to be passed as well, and standardized so that there is agreement on the meaning of the expression? Regular expressions are usually interchanged using (some approximation of) Posix syntax, so as abc.*\*, not abcANYSTAR*. Note the phrase compiled form in my posting. It is also not clear to me that it is desirable to encode operators of regular expressions as individual characters, because then you get into the slippery slope of encoding operators for every function that someone might want, and that is what started this thread isn't it... Ah, but for internal use you can do what you want with the 66 non-characters and the 4 pseudo-non-characters. (But a Unicode APL operator set would be nice. ;-) ) Um, we have one of those, don't we? -- John Cowan [EMAIL PROTECTED] I am a member of a civilization. --David Brin
22nd Unicode Conference, Sep 2002, San Jose, CA -- Just 3 weeks to go!
*** Register now! Just 3 weeks to go Register now! Just 3 weeks to go *** Twenty-second International Unicode Conference (IUC22) Unicode and the Web: Evolution or Revolution? http://www.unicode.org/iuc/iuc22 September 9-13, 2002 San Jose, California *** Full program now live! Five days of 3 tracks! Check the Web site! *** NEWS Visit the Conference Web site ( http://www.unicode.org/iuc/iuc22 ) to check the Conference program and register. To help you choose Conference sessions, we've included abstracts of talks and speakers' biographies. Guest rooms at the DoubleTree Hotel San Jose still available at the conference rate. Early bird registration rate extended to 23 August. CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation Microsoft Corporation Netscape Communications Oracle Corporation Reuters Ltd. Sun Microsystems, Inc. World Wide Web Consortium (W3C) GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. For details, visit the Conference Web site. CONFERENCE VENUE The Conference will take place at: DoubleTree Hotel San Jose 2050 Gateway Place San Jose, CA 95110 USA Tel: +1 408 453 4000 Fax: +1 408 437 2898 CONFERENCE MANAGEMENT Global Meeting Services Inc. 8949 Lombard Place, #416 San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: [EMAIL PROTECTED] or: [EMAIL PROTECTED] THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail [EMAIL PROTECTED] * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. - --- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.
Re: Furigana
John Cowan wrote: Tex Texin scripsit: Why would you want them to be for internal-use only? When you exchange regular expressions wouldn't you want operators such as any character to be passed as well, and standardized so that there is agreement on the meaning of the expression? Regular expressions are usually interchanged using (some approximation of) Posix syntax, so as abc.*\*, not abcANYSTAR*. Note the phrase compiled form in my posting. Seems like a very minor optimization then. (I am not saying undesirable, just it is a small benefit.) It is also not clear to me that it is desirable to encode operators of regular expressions as individual characters, because then you get into the slippery slope of encoding operators for every function that someone might want, and that is what started this thread isn't it... Ah, but for internal use you can do what you want with the 66 non-characters and the 4 pseudo-non-characters. Yes. Same thing is true for higher level protocols. (But a Unicode APL operator set would be nice. ;-) ) Um, we have one of those, don't we? Sorry, I was unclear. I meant this in the context of encoding a set of APL-like operators for working on Unicode text to manipulate them in regular expressions, going way beyond the any character, 0 or more character operators. tex -- John Cowan [EMAIL PROTECTED] I am a member of a civilization. --David Brin -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Revised proposal for Missing character glyph
Proposed unknown and missing character representation. This would be an alternate to method currently described in 5.3. The missing or unknown character would be represented as a series of vertical hex digit pairs for each byte of the character. BMP characters would be represented with 4 hex digits or two pairs of hex digits. Plane 1-16 characters would be represented as 6 digits or 3 pairs of digits. Garbage data with non-zero bits 24-31 may require 8 digits or 4 pairs of digits. This representation would be recognized by untrained people as unrenderable data or garbage. So it would serve the same function as a missing glyph character except that it would be different from normal glyphs so that they would know that something was wrong and the text did not just happen to have funny characters. It would aid people in finding the problem and for people with Unicode books the text would be decipherable. If the information was truly critical they could have the text deciphered. The missing character glyphs will be best rendered as a series of glyphs by a font engine capable of glyph positioning. If that is not possible it could also be rendered by displaying a fractional space followed by a set of two to three hex pair glyphs for each character byte follows by another fractional space. This would require 256 glyphs for the vertical hex pairs and a fractional space glyph. This proposal would provide a standardized approach that vendors could adopt to clarify missing character rendering and reduce support costs. By including this in the standard we could provide a cross vendor approach. This would provide a consistent solution.
RE: OCR characters
Otto, I am looking at ISO 1073/II-1976: The two erase characters are the only members of set #5, reference numbers are 120 and 121. The Remarks column is empty. 6.4 says : Application advise is given in the column Remarks, where it is indicated, inter alia, which characters are included for general purpose use only and should not be used for OCR purposes. (I guess, an empty column means that the character can be used for OCR). I have not found any more information in ISO 1073/II:1976. Sorry Arnold -Original Message- From: Otto Stolz [mailto:[EMAIL PROTECTED]] Sent: Friday, August 16, 2002 10:30 AM To: Winkler, Arnold F Cc: Eric Muller; [EMAIL PROTECTED] Subject: Re: OCR characters Eric Muller had written: In our OCR fonts, we have two glyphs named erase [...] and grouperase [...] I suspect those are mandated by these standards. On the other hand, and I can't find traces of those in Unicode, Arnold F. Winkler wrote: I believe, Eric is talking about the characters on the attached page 8 of the OCR standard. I don't have ISO 1073 at hand, only the German - DIN 66 008 (Jan 1978), which is essentially identical with ISO 1073/I-1976, and - DIN 66 009 (Sept. 1977), which is based on, but not identical with, ISO 1073/II-1976. DIN 66 008 contains the figure reported by Arnold Winkler. This standard does not specify the intended usage of these characters -- not beyond their expressive names. DIN 66 009 says about the equivalent OCR-B characters (my translation): In case of a typo, a keyboard-driven device will print the Character Erase on top of an erroneous character. This will cause the OCR reading device to ignore this position. The Group Erase may be either drawn by hand, or printed as discussed in the previous paragraph. It will cause the OCR reading device to ignore this position. So, these characters would never be read by an OCR device. They would be printed only in response to a function key (such as Erase Backwards), but never sent (encoded as characters) to a device. This means, that they will not normally be encoded, hence there will probably no need to assgin Uni- codes to them. The only exception could be a text discussing these characters, and their usage. I think, this sort of text would use figures rather than characters, to show the effect of overprinting in several variants. (The Erase, and the erased, character's positions may slightly differ.) So I guess, these characters are deliberately left off Unicode. Best wishes, Otto Stolz
RE: OCR characters
Folks, that is my VERY LAST post on this VERY OLD subject: In the L2 document register I found L2/98-397 http://www.unicode.org/L2/L2/98396.pdf which is a proposal for ISO/IEC TR 15907, a Type 3 TR for the revision of ISO 1073/II:1976. On page 18 is a note that says: NOTE – The glyphs previously defined with reference numbers 120 (CHARACTER ERASE) and 121 (GROUP ERASE) have been deleted. That's the end of my digging in older documents. And have a nice weekend too ! Arnold -Original Message- From: Otto Stolz [mailto:[EMAIL PROTECTED]] Sent: Friday, August 16, 2002 10:30 AM To: Winkler, Arnold F Cc: Eric Muller; [EMAIL PROTECTED] Subject: Re: OCR characters Eric Muller had written: In our OCR fonts, we have two glyphs named erase [...] and grouperase [...] I suspect those are mandated by these standards. On the other hand, and I can't find traces of those in Unicode,
Unicode.org downtime reminder
This is a reminder. The Unicode.ORG system (web services, ftp, and mail lists) will be taken off-line sometime today for maintenance and upgrades. We will keep the downtime as short as possible. You will receive another note when the system comes back up, but it may note be possible to warn you again before the system is taken off-line, due to scheduling with our service provider. Regards, -- Sarasvati
Re: The existing rules for U+FFF9 through to U+FFFC. (spins from Re: Furigana)
On 08/15/2002 06:41:59 AM William Overington wrote: In essence, though not formally, U+FFF9..U+FFFC are non-characters as well, and the Unicode semantics just tells what programs *may* find them useful for. Unicode 4.0 editors: it might be a good idea to emphasize the close relationship of this small repertoire with the non-characters. That is not what the specification says. William, John knows what he is talking about, and is exactly correct: in essense, though not formally, FFF9..FFFC are non-characters. No, the Standard doesn't say that; that's why he said, not formally. The use intended by the Standard is, however, exactly comparable to the non-characters at FDD0..FDEF. If they had been defined in the Standard as non-characters, the world would not be different in any meaningful way. It appears to me that the use of the annotation characters in document interchange is never forbidden and is strongly discouraged only where there is no prior agreement between the sender and the receiver, and that that strong discouragement is because the content may be misinterpreted otherwise. So, if there is a prior agreement, then there is no problem about using them in interchanged documents. There appears to be nothing that suggests that U+FFFC cannot be used in an interchanged document. Well, you've missed the intent of the authors of the Standard, and appear not to grasp the mindset. When it says interchange of IA characters may be OK given prior agreement, what's really in mind is that e.g. I've written code library A that handles some aspects of interlinear annotation, you've written code library B that handles different aspects of interlinear annotation, and we agree on certain interfaces so that my library can call yours or vice versa, and agree that strings passed by those interfaces can contain IA characters. That's the kind of thing that's in mind. It does *not* imply that anyone should consider create a document containing IA characters. I know little about Bliss symbols, though I have seen a few of them and have read a brief introduction to them, yet it seems to me that annotating Bliss symbols with English or Swedish is entirely within the specification absolutely and would be no more than strongly discouraged even if there is no prior agreement between the sender and the receiver. Of course the Standard doesn't discourage anyone from annotating Bliss symbols with English or Swedish; it only discourages the use of IA characters as markup in documents. Further, it seems to me from the published rules that these annotation characters could possibly be used to provide a footnote annotation facility within a plain text file That would not be a proposal worth pursuing; in fact, I'd say it's a very bad idea. The reason you DO NOT want to use IA characters in a document is that you do not know what someone's software will do with them. The characters have always been intended for use by software programmers, not by content authors. (Ditto for the object replacement character.) An interesting point for consideration is as to whether the following sequence is permitted in interchanged documents... It seems to me that if that is indeed permissible that it could potentially be a useful facility. On the whole, it would be very unwise to use these characters in documents for reasons I explained above. If two people agree to do this, nobody's going to send the Unicode police to stop them. But very few of us on this list are particularly interested in what is hypothetically possible for some pair of us to do. We're far more interested in how widely-used implementations should and do work, and in such implementations, FFF9..FFFC are assumed not to be use in content. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]