Re: Biblical Hebrew
John Hudson wrote: > At 03:52 PM 6/26/2003, Rick McGowan wrote: > > >I'll weigh in to agree with Ken here. The solution of cloning a whole set > >of these things just to fix combining behavior is, to understate, not quite > >nice. > > No, but would be far from the not nicest thing in Unicode, and there's a > really good reason for it. I was originally intrigued by Ken's ZWJ idea -- > or by a variant of it using some new re-ordering inhibiting character, to > avoid overloading ZWJ any further --, but the more I think about it, the > more not nice I think it is to force Biblical scholars to carry the can for > errors in the Unicode combining classes. One of the reasons I keep poking around for alternatives that might work in a different way is that cloning sets of characters this way has a way of just displacing the problem. You don't want to force Biblical scholars to "carry the can" for the errors in the current combining classes... But who then does end up carrying the can eventually, if we go the cloning route? Cloning 14 characters creates a *new* normalization problem, and forces non-Biblical-scholar users of pointed Hebrew text to carry *that* particular can. How does a user of pointed Hebrew text know whether they are dealing with the legacy points, which people will have gone on using, outside the context of the group of cognoscenti who switch their applications and fonts over to the corrected set of points? What happens if they edit text represented in one scheme with a tool meant for the other? What about searches on data with pointed Hebrew -- should it normalize the two sets of points or not? (And here I am talking about normalization by an ad hoc, custom folding, rather than generic Unicode normalization.) Who carries the can for writing the conversion routines from data in one scheme or the other? How about conversion from legacy character sets for bibliographic data -- does that need to be upgraded? How about database implementations -- do they need custom extensions to do this folding as part of their query optimizations? And if the problem with the existing set of points is that their use in a normalized context eliminates distinctions that should be maintained, how do I write any conversion routines in such a way as to not corrupt or otherwise contaminate data using the new scheme? Who do I blame if my Hebrew fonts works with one set of points but not the other, and I'm getting intermittently trashed display as a result? ... and so on... I think if you really sit down and think about this in the larger context of users of Unicode Hebrew generically, instead of merely the Biblical Hebrew community that you are trying to find a solution for, you may realize that displacing the pain to *other* users may not be the best solution, either. While the solution I am suggesting is not without its conversion problems, I think they are significantly more tractable than those posed by cloning code points. The folding issue is much more straightforward, since it would consist entirely of ignoring the CGJ and applying standard normalization (or not). The new scheme would essentially be transparent to systems that don't bother inserting CGJ between points, as long as their fonts could handle the combinations. Loss of distinctions in order for data which is exported from the new systems, and then reimported, would be much less of an issue, since normalization could not destroy the distinctions without further intervention. > I believe the aim in fixing this > problem in Unicode should be to provide Biblical scholars with a good text > processing experience, not with awkward kludges, Yes, but I believe that is the responsibility of the systems and applications designers, given the tools and constraints we have to hand. > even if that means making > the Unicode Hebrew block look weird with duplicated marks. I really believe there be dragons there, and the end result will be to make it *more* difficult for the systems and applications designers to provide a "good text processing experience" to all users of pointed Hebrew text. --Ken
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
John, > At 03:36 PM 6/26/2003, Kenneth Whistler wrote: > > >Why is making use of the existing behavior of existing characters > >a "groanable kludge", if it has the desired effect and makes > >the required distinctions in text? If there is not some > >rendering system or font lookup showstopper here, I'm inclined > >to think it's a rather elegant way out of the problem. > > I think assumptions about not breaking combining mark sequences may, in > fact, be a showstopper. If becomes > , it is reasonable to think that this will not > only inhibit mark re-ordering but also mark combining and mark > interraction. Unfortunately, this seems to be the case with every control > character I have been able to test, using two different rendering engines > (Uniscribe and InDesign ME -- although the latter already has some problems > with double marks in Biblical Hebrew). Perhaps we should have a specific > COMBINING MARK SEQUENCE CONTROL character? Actually, in casting around for the solution to the problem of introduction of format controls creating defective combining character sequences, it finally occurred to me that: U+034F COMBINING GRAPHEME JOINER has the requisite properties. It is non-visible, does not affect the display of neighboring characters (except incidentally, if processes choose to recognize sequences containing it and process them distinctly), *AND* it is a *combining mark*, not a format control. Hence, the sequence: 0 170 14 is *not* a defective combining character sequence, by the definitions in the standard. The entire sequence of three combining marks would have to "apply" to the lamed, but the fact that CGJ has (cc=0) prevents the patah from reordering around the hiriq under normalization. Could this finally be the missing "killer ap" for the CGJ? > > All that said, I disagree with Ken that this is anything like an elegant > way out of the problem. Forcing awkward, textually illogical and easily > forgetable control character usage onto *users* in order to solve a problem > in the Unicode Standard is not elegant, and it is unlikely to do much for > the reputation of the standard. I don't understand this contention. There is no reason, in principle, why this has to be surfaced to end users of Biblical Hebrew, any more than messy details of embedding override controls has to be surfaced to end users in order to make an interface which will support end user control over direction in bidirectional text. If CGJ is the one, then the only *real* implementation requirement would be that CGJ be consistently inserted (for Biblical Hebrew) between any pair of points applied to the same consonant. Depending on the particular application, this could either be hidden behind the input method/keyboard and be actively managed by the software, or it could be applied as a filter to an export format, when exporting to contexts that might neutralize intended contrasts or result in the wrong display by the application of normalization. > > Q: 'Why do I have to insert this control character between these points?' > A: 'To prevent them from being re-ordered.' > Q: 'But why would they be re-ordered anyway? Why wouldn't they just stay in > the order I put them in?' > A: 'Because Unicode normalisation will automatically re-order the points.' > Q: 'But why? Points shouldn't be re-ordered: it breaks the text.' > A: 'Yes, but the people who decided how normalisation should work for > Hebrew didn't know that.' > Q: 'Well can't they fix it?' > A: 'They have: they've told you that you have to insert this control > character...' And that whole dialogue should be limited to the *programmers* only, whose job it is then to hide the details of how they get the magic to work from people who would find those details just confusing. > Q: 'But *I* didn't make the mistake. Why should I have to be the one to > mess around with this annoying control character?' > > ... and so on. > > Much as the duplication of Hebrew mark encoding may be distasteful, and > even considering the work that will need to be done to update layout > engines, fonts and documents to work with the new mark characters, I agree > with Peter Constable that this is by far the best long term solution, > especially from a *user* perspective. I have to disagree. It should be largely irrelevant to the user perspective. In this case (as in others) the users are the experts about what their expected requirements are for text behavior, and in particular, what distinctions need to be maintained. But they should not be expected to define the technical means for fulfilling those requirements, nor lean over the shoulders of the engineers to tell them how to write the software to accomplish it. > Over the past two months I have been > over this problem in great detail with the Society of Biblical Literature > and their partners in the SBL Font Foundation. They understand the problems > with the current nor
Re: Biblical Hebrew
1. I agree with Ken about the current lack of precedent for Cfs before combining marks. Interestingly, that we do have a proposal to do just that, in http://www.unicode.org/review/pr-9.pdf However, note that the whole purpose of putting the Cf after the Ra is to separate it from the halant, so that the halant will ligate with the following character rather than the preceding. So in that sense, PR#9 is entirely consistent with breaking a combining character sequence into two parts. 2. Because Cfs do break combining sequences, I would be very leery of using any of them to solve the Biblical Hebrew issue. One possibility is to use a combining mark instead. That is, something with (α) no visible glyph, (β) combining class = 0, and (γ) general category = Mn. Unlike the Cfs, this would *not* break a combining sequence. There would be two possibilities. a. define a new character with these characteristics. b. use a variation selection character. Now, we decided that VS characters would not apply to any but base characters, but one of the primary reasons for that was so that they wouldn't disturb canonical order. So easing this restriction in this case might be reasonable, since that is exactly the point! Of course, such a change would need to be sanctioned by the UTC, and it might take a while before fonts supported it, but it may be a way out, one that doesn't require waiting for the assignment of a new character. So this is in the spirit of Ken's original proposal. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Kenneth Whistler" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Thursday, June 26, 2003 17:48 Subject: Re: Biblical Hebrew > Rick wrote: > > > > I now like better the suggestions of RLM or WJ for this. > > > > I'll have to disagree with Ken. I'm not so sure about either of these. I > > don't think anyone has, in the past, considered what conforming or > > non-conforming behavior would be for a RLM or WJ between two combining > > marks. This needs a bunch more study to determine what on earth it would > > break in existin implementations. > > Point taken. > > > > > On the other hand, ZWJ between two combining marks has at least been > > discussed, and in the case of Indic anyway, it has known, documented > > effects. > > This, however, has the same problem. The specification of the use > of ZWJ and ZWNJ in Indic scripts is not *between* two combining > marks, but following a combining mark (halant), preceding another > base character, usually a consonant. So we don't really know what > the implications of trying to put it between two combining marks > would be -- there aren't any specifications for doing so (yet). > > --Ken > > >
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Michael wrote: > At 15:36 -0700 2003-06-26, Kenneth Whistler wrote: > > >I now like better the suggestions of RLM or WJ for this. > > ZZZT. Thank you for playing. > > RLM is for forcing the right behaviour for stops and parentheses and > question marks and so on. Introducing it between two combining > characters in Hebrew text would break all kinds of things, True, apparently, but not for the reasons you surmise. RLM does not "force behavior" on things. It is a strong right-to-left context that can change the resolved directionality of neutrals or weak types next to it. In between two characters that are already R, the presence or absence of an RLM is basically a no-op for bidi. Just considering the bidi algorithm, a sequence: R NSM R NSM would have the resolved directions: , effectively no different than the resolved direction: of the sequence without the RLM. The problem arises when you go to consider the graphic application of the combining mark to its base form, and for that, the issue is apparently the same for the WJ, ZWJ, or any other format control in such a position. So this is nothing to do with the bidi function of RLM. > and would > be horrible, horrible, horrible. Invent a new control character for > this weird property-killer, if you must, but don't use an ordering > mark for it If you invent a "new control character" for this "weird property-killer" (which it wouldn't be, since in any case, I'm just talking about inserting a (cc=0) character in between two other characters, not changing or killing any properties), you still end up with exactly the same problem of graphic application, because the presence of any format control creates a defective combining character sequence which applications (apparently) won't display. --Ken
Re: Biblical Hebrew
Rick wrote: > > I now like better the suggestions of RLM or WJ for this. > > I'll have to disagree with Ken. I'm not so sure about either of these. I > don't think anyone has, in the past, considered what conforming or > non-conforming behavior would be for a RLM or WJ between two combining > marks. This needs a bunch more study to determine what on earth it would > break in existin implementations. Point taken. > > On the other hand, ZWJ between two combining marks has at least been > discussed, and in the case of Indic anyway, it has known, documented > effects. This, however, has the same problem. The specification of the use of ZWJ and ZWNJ in Indic scripts is not *between* two combining marks, but following a combining mark (halant), preceding another base character, usually a consonant. So we don't really know what the implications of trying to put it between two combining marks would be -- there aren't any specifications for doing so (yet). --Ken
Re: Biblical Hebrew
At 03:52 PM 6/26/2003, Rick McGowan wrote: I'll weigh in to agree with Ken here. The solution of cloning a whole set of these things just to fix combining behavior is, to understate, not quite nice. No, but would be far from the not nicest thing in Unicode, and there's a really good reason for it. I was originally intrigued by Ken's ZWJ idea -- or by a variant of it using some new re-ordering inhibiting character, to avoid overloading ZWJ any further --, but the more I think about it, the more not nice I think it is to force Biblical scholars to carry the can for errors in the Unicode combining classes. Control characters, usually ZWJ and ZWNJ, seem to get proposed as solutions to all sorts of text processing complexities. Some of these are perfectly legitimate and reflect the need of users to be able to to control the display of text in different ways, e.g. by forcing half-forms in Indic scripts. But I don't think control characters should be used as fixes for mistakes, especially not when the distinction is not between two different but equally valid ways of displaying the same text, e.g. as a conjunct ligature or with half-forms, but between displaying text correctly or incorrectly. How many English users would accept a text processing model in which the distinction between 'goal' and 'gaol' relied on insertion of a control character between the vowels? I believe the aim in fixing this problem in Unicode should be to provide Biblical scholars with a good text processing experience, not with awkward kludges, even if that means making the Unicode Hebrew block look weird with duplicated marks. The standard should serve the users, not the aesthetic and organisational sensitivities of the people who design the standard. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 03:36 PM 6/26/2003, Kenneth Whistler wrote: Why is making use of the existing behavior of existing characters a "groanable kludge", if it has the desired effect and makes the required distinctions in text? If there is not some rendering system or font lookup showstopper here, I'm inclined to think it's a rather elegant way out of the problem. I think assumptions about not breaking combining mark sequences may, in fact, be a showstopper. If becomes , it is reasonable to think that this will not only inhibit mark re-ordering but also mark combining and mark interraction. Unfortunately, this seems to be the case with every control character I have been able to test, using two different rendering engines (Uniscribe and InDesign ME -- although the latter already has some problems with double marks in Biblical Hebrew). Perhaps we should have a specific COMBINING MARK SEQUENCE CONTROL character? All that said, I disagree with Ken that this is anything like an elegant way out of the problem. Forcing awkward, textually illogical and easily forgetable control character usage onto *users* in order to solve a problem in the Unicode Standard is not elegant, and it is unlikely to do much for the reputation of the standard. Q: 'Why do I have to insert this control character between these points?' A: 'To prevent them from being re-ordered.' Q: 'But why would they be re-ordered anyway? Why wouldn't they just stay in the order I put them in?' A: 'Because Unicode normalisation will automatically re-order the points.' Q: 'But why? Points shouldn't be re-ordered: it breaks the text.' A: 'Yes, but the people who decided how normalisation should work for Hebrew didn't know that.' Q: 'Well can't they fix it?' A: 'They have: they've told you that you have to insert this control character...' Q: 'But *I* didn't make the mistake. Why should I have to be the one to mess around with this annoying control character?' ... and so on. Much as the duplication of Hebrew mark encoding may be distasteful, and even considering the work that will need to be done to update layout engines, fonts and documents to work with the new mark characters, I agree with Peter Constable that this is by far the best long term solution, especially from a *user* perspective. Over the past two months I have been over this problem in great detail with the Society of Biblical Literature and their partners in the SBL Font Foundation. They understand the problems with the current normalisation, and they understand that any solution is going to require document and font revisions; they're resigned to this, and they've worked hard to come up with combining class assignments that would actually work for all consonant + mark(s) sequences encountered in Biblical Hebrew. This work forms the basis of the proposal submitted by Peter Constable. Encoding of new Biblical Hebrew mark characters provides a relatively simple update path for both documents and fonts, since it largely involves one-to-one mappings from old characters to new. Conversely, insisting on using control characters to manage mark ordering in texts will require analysis to identify those sequences that will be subject to re-ordering during normalisation, and individual insertion of control characters. The fact that these control characters are invisible and not obvious to users transcribing text, puts an additional burden on application and font support, and adds another level of complexity to using what are already some of the most complicated fonts in existence (how many fonts do you know that come with 18 page user manuals?). I think it is unreasonable to expect Biblical scholars to understand Unicode canonical ordering to such a deep level that they are able to know where to insert control characters to prevent a re-ordering that shouldn't be happening in the first place. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew
Ken wrote... > I now like better the suggestions of RLM or WJ for this. I'll have to disagree with Ken. I'm not so sure about either of these. I don't think anyone has, in the past, considered what conforming or non-conforming behavior would be for a RLM or WJ between two combining marks. This needs a bunch more study to determine what on earth it would break in existin implementations. On the other hand, ZWJ between two combining marks has at least been discussed, and in the case of Indic anyway, it has known, documented effects. > > At least with > > having distinct vowel characters for Biblical Hebrew, we'd come to a > point > > we could forget about it, and wouldn't be wincing every time we considered > > it. > > Au contraire. We'll be wincing forever for this one. There's > no way of getting around the fact that this is merely a cloning > of a the whole set of points in order to have candidates for > a reassigned set of combining classes. I'll weigh in to agree with Ken here. The solution of cloning a whole set of these things just to fix combining behavior is, to understate, not quite nice. The *best* thing to do, in my personal opinion and I know it'll get shot down so don't bother telling me so, is to fix the combining classes of the Hebrew points. Since the combining classes can't be fixed because we have the normalization-stability albatross firmly down our gullets and will forever be choking on that, the next best thing is to use a ZWJ. Problem solved. Just document it. Rick
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 15:36 -0700 2003-06-26, Kenneth Whistler wrote: I now like better the suggestions of RLM or WJ for this. ZZZT. Thank you for playing. RLM is for forcing the right behaviour for stops and parentheses and question marks and so on. Introducing it between two combining characters in Hebrew text would break all kinds of things, and would be horrible, horrible, horrible. Invent a new control character for this weird property-killer, if you must, but don't use an ordering mark for it. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Yerushala(y)im - or Biblical Hebrew (was Major Defect in Combining Classes of Tibetan Vowels)
At 03:04 PM 6/26/2003, Kenneth Whistler wrote: > How about RLM? This already belongs, naturally, in the context of the Hebrew text handling, which is going to have to handle bidi controls. Ouch. RLM is not expected to fall between combining marks. Not only does this not render correctly, Uniscribe treats it as an illegal sequence and inserts a dotted circle before the second mark. Another possibility to consider is U+2060 WORD JOINER, the version of the zero width non-breaking space unfreighted with the BOM confusion of U+FEFF. I can't test this at the moment, because none of the fonts I have support it. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 02:45 PM 6/26/2003, Mark Davis wrote: Another consequence is that it separates the sequence into two combining sequences, not one. Don't know if this is a serious problem, especially since we are concerned with a limited domain with non-modern usage, but I wanted to mention it. It is a serious problem if separate combining sequences means, as it seems to in all the current apps I have tested, that marks separated by one of these control characters cannot be correctly positioned relative to a preceding consonant. Insertion of any zero-width control character between two marks applied to the same Hebrew consonant results in a loss of interraction between the marks (i.e. the first mark is not repositioned to accomodate the second) and the second mark loses all positioning intelligence and falls between the consonant and the next one. My guess is that the layout engine (Uniscribe in this case) makes the reasonable assumption that the two combining sequences do not interract. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Peter responded: > Ken Whistler wrote on 06/25/2003 06:57:56 PM: > > > People could consider, for example, representation > > of the required sequence: > > > > > > > > as: > > > > > > So, we want to introduce yet *another* distinct semantic for ZWJ? Actually, no, I don't. That was just the first candidate that came to mind. > We've > got one for Indic, another for Arabic, another for ligatures (similar to > that for Arabic, but slightly different). Now another that is "don't > affect any visual change, just be there to inhibit reordering under > canonical ordering / normalization"? As I pointed out in a separate response, just putting the ZWJ there would *already* interrupt the reodering of the sequence. There is nothing new about that. The problem is that you might not be able to count on it not effecting a visual change, because the generic meaning of ZWJ is now intended to be ligation requesting, which does have visual consequences. I now like better the suggestions of RLM or WJ for this. Both of those format controls, by *definition*, should have no impact on visual display in this context, the RLM because it would be inserted between two NSM's that pick up strong R-to-L directionality from the consonant, and the WJ because it would be inserted at a position where there already is no word/line break opportunity. But either of them, by their current definition and properties, would break the sequences for canonical reordering. So they already have the semantics of the putative new control in question: no effect on visual display, while inhibiting of the canonical reordering of the point sequence. > > The presence of a ZWJ (cc=0) in the sequence would block > > the canonical reordering of the sequence to hiriq before > > qamets. If that is the essence of the problem needing to > > be addressed, then this is a much simpler solution which would > > impact neither the stability of normalization nor require > > mass cloning of vowels in order to give them new combining > > classes. > > Yes, it would accomplish all that; and is groanable kludge. Why is making use of the existing behavior of existing characters a "groanable kludge", if it has the desired effect and makes the required distinctions in text? If there is not some rendering system or font lookup showstopper here, I'm inclined to think it's a rather elegant way out of the problem. > At least with > having distinct vowel characters for Biblical Hebrew, we'd come to a point > we could forget about it, and wouldn't be wincing every time we considered > it. Au contraire. We'll be wincing forever for this one. There's no way of getting around the fact that this is merely a cloning of a the whole set of points in order to have candidates for a reassigned set of combining classes. You're stuck between a rock and a hard place on this one. The UTC cannot entertain merely fixing the existing combining class assignments, because it breaks the normalization stability guarantee. We've all come to acknowledge and most to accept that, even though it still elicits groans. But in the 10646 WG2 context, coming in with a duplicate set of Hebrew points is not going to make any sense, because, as someone (John Cowan?) has already pointed out, 10646 doesn't assign combining classes, and so trying to justify character cloning on the basis of distinct combining class assignments isn't going to make any sense there. You can always come in with the proposal to encode BIBLICAL HEBREW POINT PATAH and say, even though the glyph is identical, see, the name is different, so the character is different. But this is a pretty thin disguise, and is vulnerable to simple questioning: What is it for? Well, to point Biblical Hebrew texts. But what was U+05B7 HEBREW POINT PATAH for? Well, to point Biblical Hebrew texts (or any Hebrew text, for that matter...). Well, then, what is the difference? Uh, the combining classes for the two are different. What is a combining class? ... and so on. I'm trying to find a way, using existing characters and a simple set of text representational conventions, to make the distinctions and preserve the order relations that you need for decent font lookup, without the whole enterprise washing up on either of those two rocks. --Ken
Re: Yerushala(y)im - or Biblical Hebrew (was Major Defect in Combining Classes of Tibetan Vowels)
Jony took the words right out of my mouth: > How about RLM? > > Jony This already belongs, naturally, in the context of the Hebrew text handling, which is going to have to handle bidi controls. Another possibility to consider is U+2060 WORD JOINER, the version of the zero width non-breaking space unfreighted with the BOM confusion of U+FEFF. WJ is also (gc=Cf, cc=0), so would block canonical reordering of a sequence it was inserted into. Unlike ZWJ, it should have no potentially conflicting semantics regarding ligation or anything else for display. It is *defined* only as specifying no break opportunity at its position: "...inserting a word joiner between two characters has no effect on their ligating and cursive joining behavior. The word joiner should be ignored in contexts other than word or line breaking." Well, as before, we already know that ^ is not a word or line break opportunity, so inserting a WJ there should have no effect. And by definition, it should also have no effect on any glyph ligation (or any other aspect of the display). But it *would* break up the sequence that gets canonically reordered for normalization, thus enabling a textual distinction to be preserved. One might even want to suggest that if RichEdit or some other text control causes a display problem when WJ is inserted between two Hebrew points, that should be considered a bug in the implementation of the WORD JOINER for that text control. Of course, I'm not privy to the internals of such implementations and don't understand the font lookup issues in the kind of detail that John clearly does, but if WORD JOINER cannot be implemented as the standard says it should be, then we've got a more serious problem on our hand than just the Biblical Hebrew vocalization issue. --Ken > > > > At 04:26 AM 6/26/2003, Jony Rosenne wrote: > > > > >I don't think we need any new characters, ZERO WIDTH SPACE > > would do and > > >it requires no new semantics. > > > > ZERO WIDTH SPACE would screw up search and sort algorithms, I think, > > because it is not a control character per se and may not be > > ignored as desired. > > > > I've made some tests using Ken's ZWJ suggestion and, as > > feared, it messes > > with the glyph positioning lookups. The results varied > > slightly between MS > > RichText clients and InDesign ME, but both displayed marks > > incorrectly when > > ZWJ was inserted. I strongly suspect that this is not > > something that can > > easily be resolved in the glyph shaping model. > > > > John Hudson
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Another consequence is that it separates the sequence into two combining sequences, not one. Don't know if this is a serious problem, especially since we are concerned with a limited domain with non-modern usage, but I wanted to mention it. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Kenneth Whistler" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Thursday, June 26, 2003 13:41 Subject: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels) > Peter replied to Karljürgen: > > > Karljürgen Feuerherm wrote on 06/25/2003 08:31:41 PM: > > > > > I was going to suggest something very similar, a ZW-pseudo-consonant of > > some > > > kind, which would force each vowel to be associated with one consonant. > > > > An invisible *consonant* doesn't make sense because the problem involves > > more than just multiple written vowels on one consonant; > > I agree that we don't want to go inventing invisible consonants for > this. > > BTW, there's already an invisible vowel (in fact a pair of them) > that is unwanted by the stakeholders of the script it was > originally invented for: > > U+17B4 KHMER VOWEL INHERENT AQ > > This is also (cc=0), so would serve to block canonical reordering > if placed between two Hebrew vowel points. But I'm sure that if > Peter thought the suggestion of the ZWJ for this was a "groanable > kludge", Biblical Hebraicists would probably not take lightly > to the importation of an invisible Khmer character into their > text representations. ;-) > > > in fact, that is > > a small portion of the general problem. If we want such a character, it > > would notionally be a zero-width-canonical-ordering-inhibiter, and nothing > > more. > > The fact is that any of the zero-width format controls has the > side-effect of inhibiting (or rather interrupting) canonical reordering > if inserted in the middle of a target sequence, because of their > own class (cc=0). > > I'm not particularly campaigning for ZWJ, by the way. ZWNJ or even > U+FEFF ZWNBSP would accomplish the same. I just suggested ZWJ because > it seemed in the ballpark. ZWNBSP would likely have fewer possible > other consequences, since notionally it means just "don't break here", > which you wouldn't do in the middle of a Hebrew combining character > sequence, anyway. > > > And I don't particular want to think about what happens when people start > > sticking this thing into sequences other than Biblical Hebrew ("in > > unicode, any sequence is legal"). > > But don't forget that these cc=0 zero width format controls already > can be stuck into sequences other than Biblical Hebrew. In some > instances they have defined semantics there (as for Arabic and > Indic scripts), but in all cases they would *already* have the > effect of interrupting canonical reordering of combining character > sequences if inserted there. > > --Ken > > > >
RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)
That may be what you see. Myself, every time I look at it, I see an orphaned Hiriq without a consonant. It is normally placed in between the Lamed and the Mem, to make certain the point isn't missed (a pun). Jony > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > [EMAIL PROTECTED] > Sent: Thursday, June 26, 2003 7:09 PM > To: [EMAIL PROTECTED] > Subject: RE: Major Defect in Combining Classes of Tibetan > Vowels (Hebrew) > > > Jony Rosenne wrote on 06/26/2003 06:26:02 AM: > > > It may look, silly, but it is correct. What you see are letters > according to > > the writing tradition, which does not include a Yod, and vowels > according to > > the reading tradition which does. > > I understand that. My point was, you were talking about > phonology, but in > terms of the text, it was not correct: there *are* multiple > vowels on a > single consonant. > > > > There are in the Bible other, more extreme > > cases. > > I'd be interested on whatever info you can provide in that regard. > > > > > I don't think we need any new characters, ZERO WIDTH SPACE would do > > and > it > > requires no new semantics. > > No, that's a terrible solution: a space creates unwanted word > boundaries. > > > > Moreover, everybody who knows his Hebrew Bible > > knows the Yod is there although it isn't written. > > But the point is, how to people encode the text? The yod is > not there in > the text. How does a publisher encode text in the typesetting > process? How > do researchsers encode the text they want to analyze? Saying, > "everybody > knows there's a yod there" doesn't provide a solution, > particular given > that the researchers know in point of fact that the consonantal text > explicitly does not include a yod. > > > > > The Meteg is a completely different issue. There is a small > number of > places > > were the Meteg is placed differently. Since it does not behave the > > same > as > > the regular Meteg, and is thus visually distinguishable, it > should be > > possible to add a character, as long as it is clearly named. > > That is a potential solution, thought it would have to be > *two* additional > metegs. > > > > - Peter > > > -- - > Peter Constable > > Non-Roman Script Initiative, SIL International > 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA > Tel: +1 972 708 7485 > > > >
Yerushala(y)im - or Biblical Hebrew (was Major Defect in Combining Classes of Tibetan Vowels)
How about RLM? Jony > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of John Hudson > Sent: Thursday, June 26, 2003 6:36 PM > To: Jony Rosenne > Cc: [EMAIL PROTECTED] > Subject: SPAM: RE: Major Defect in Combining Classes of > Tibetan Vowels (Hebrew) > > > At 04:26 AM 6/26/2003, Jony Rosenne wrote: > > >I don't think we need any new characters, ZERO WIDTH SPACE > would do and > >it requires no new semantics. > > ZERO WIDTH SPACE would screw up search and sort algorithms, I think, > because it is not a control character per se and may not be > ignored as desired. > > I've made some tests using Ken's ZWJ suggestion and, as > feared, it messes > with the glyph positioning lookups. The results varied > slightly between MS > RichText clients and InDesign ME, but both displayed marks > incorrectly when > ZWJ was inserted. I strongly suspect that this is not > something that can > easily be resolved in the glyph shaping model. > > John Hudson > > Tiro Typeworkswww.tiro.com > Vancouver, BC [EMAIL PROTECTED] > > If you browse in the shelves that, in American bookstores, > are labeled New Age, you can find there even Saint Augustine, > who, as far as I know, was not a fascist. But combining Saint > Augustine and Stonehenge -- that is a symptom of Ur-Fascism. > > - Umberto Eco > > > >
Re: Revised N2586R
At 13:23 -0700 2003-06-26, Kenneth Whistler wrote: Not only is the name likely to change (based on all the issues already discussed), but it is conceivable that WG2 could decide to approve it at some other code position instead. Indeed I will probably propose to move the character on general principles. ;-) No cheating! ;-) It is even conceivable that WG2 could *refuse* to encode the character. (I shouldn't think so.) There have been precedents, where a UTC approved character met opposition in WG2, and the UTC later decided to rescind its approval in favor of maintaining synchronization of the standards when published. And vice versa. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Peter replied to Karljürgen: > Karljürgen Feuerherm wrote on 06/25/2003 08:31:41 PM: > > > I was going to suggest something very similar, a ZW-pseudo-consonant of > some > > kind, which would force each vowel to be associated with one consonant. > > An invisible *consonant* doesn't make sense because the problem involves > more than just multiple written vowels on one consonant; I agree that we don't want to go inventing invisible consonants for this. BTW, there's already an invisible vowel (in fact a pair of them) that is unwanted by the stakeholders of the script it was originally invented for: U+17B4 KHMER VOWEL INHERENT AQ This is also (cc=0), so would serve to block canonical reordering if placed between two Hebrew vowel points. But I'm sure that if Peter thought the suggestion of the ZWJ for this was a "groanable kludge", Biblical Hebraicists would probably not take lightly to the importation of an invisible Khmer character into their text representations. ;-) > in fact, that is > a small portion of the general problem. If we want such a character, it > would notionally be a zero-width-canonical-ordering-inhibiter, and nothing > more. The fact is that any of the zero-width format controls has the side-effect of inhibiting (or rather interrupting) canonical reordering if inserted in the middle of a target sequence, because of their own class (cc=0). I'm not particularly campaigning for ZWJ, by the way. ZWNJ or even U+FEFF ZWNBSP would accomplish the same. I just suggested ZWJ because it seemed in the ballpark. ZWNBSP would likely have fewer possible other consequences, since notionally it means just "don't break here", which you wouldn't do in the middle of a Hebrew combining character sequence, anyway. > And I don't particular want to think about what happens when people start > sticking this thing into sequences other than Biblical Hebrew ("in > unicode, any sequence is legal"). But don't forget that these cc=0 zero width format controls already can be stuck into sequences other than Biblical Hebrew. In some instances they have defined semantics there (as for Arabic and Indic scripts), but in all cases they would *already* have the effect of interrupting canonical reordering of combining character sequences if inserted there. --Ken
Re: Nightmares
At 14:32 -0400 2003-06-26, John Cowan wrote: If you are going to discriminate (invidiously) using a computerized database, using H for Handicapped (or G for Gimp) will do just as well. Are you going to complain about the various symbols of religion already encoded on the same grounds? I am preparing additional religious symbols to help fill the gaps. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 10:09 AM 6/26/2003, [EMAIL PROTECTED] wrote: > The Meteg is a completely different issue. There is a small number of places > were the Meteg is placed differently. Since it does not behave the same as > the regular Meteg, and is thus visually distinguishable, it should be > possible to add a character, as long as it is clearly named. That is a potential solution, thought it would have to be *two* additional metegs. Can you explain your thinking here, Peter? I agree that if the intention is to encode new Biblical Hebrew marks with revised combining classes, then two new metegs would be necessary if we want one left and one right. But if one were to accept the text encoding hack of a ZERO-WIDTH CANONICAL ORDERING INHIBITOR -- which seems less and less like a good idea, and more and more like a long term embarassment and, like ZWJ and ZWNJ, a pain in the neck for users who have every right to expect a sensible encoding that doesn't require such gymnastics --, then I think one would only need a new HEBREW POINT RIGHT METEG character, and let it be assumed that the existing meteg character is the left position form (it's current combining class puts it after all vowels, I believe). John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Revised N2586R
Doug, Peter, and Michael already provided good responses to this suggestion by William O, but here is a little further clarification. > Well, certainly authority would be needed, yet I am suggesting that where a > few characters added into an established block are accepted, which is what > is claimed for these characters, there should be a faster route than having > to wait for bulk release in Unicode 4.1. If these characters have been > accepted, why not formally warrant their use now by having Unicode 4.001 > and then having Unicode 4.002 when a few more are accepted? Approvals aren't *finished* until both the UTC and ISO JTC1/SC2/WG2 have completed their work. The JTC1 balloting and approval process is a lengthy and deliberate one, and there are many precedents where a proposed character, perhaps one already approved by the UTC, has been moved in a subsequent ballotting in response to a national body comment. Only when both committees have completed all approvals and have verified they are finally in synch with each other, do they proceed with formal publication of the *standardized* encodings for the new characters. The reasons the UTC "approves" characters and posts them in the Pipeline page at www.unicode.org in advance of the actual final standardization are: A. To avoid the chicken and the egg problem for the two committees. Someone has to go first on an approval, since the committees do not meet jointly. Sometimes the UTC goes first, and sometimes WG2 goes first. B. To give notice to people regarding what is in process and what stage of approval it is at. This helps in precluding duplicate submissions and also helps in assigning code points for new characters when we are dealing with large numbers of new submissions. > These minor > additions to the Standard could be produced as characters are accepted and > publicised in the Unicode Consortium's webspace. The UTC can and does give notification regarding what characters have reached "approved" status. The Pipeline page at www.unicode.org is, for example, about to be updated with the 215 new character approvals from the recent UTC meeting. > If the characters have not > been accepted then they cannot be considered ready to be used, yet if they > have been accepted, what is the problem in releasing them so that people who > want to get on with using them can do so? See above. Standardization bodies must move deliberately and carefully, since if they publish mistakes, everybody is saddled with them essentially forever. In the case of encoding large numbers of additional characters, because the UTC has plenty of experience at the kind of shuffling around that may occur while ballotting is still under consideration, it would be irresponsible to publish small revisions and encourage people to start using characters that we know have not yet completed all steps of the standardization process. > Why is it that it is regarded by the Unicode Consortium > as reasonable that it takes years to get a character through the committees > and into use? Because with the experience of four major revisions of the Unicode Standard (and numerous minor revisions) and the experience of three major revisions of ISO/IEC 10646 (and numerous individual amendments) under out belt, we know that is how long it takes in actual practice. > The idea of having to use the > Private Use Area for a period after the characters have been accepted is > just a nonsense. Please take a look at: http://www.unicode.org/alloc/Caution.html which has long been posted to help explain why character approval is not just an instantaneous process. The further along a particular character happens to be in the ISO JTC1 approval process, the less likely it is that it will actually move before the standard is actually published. Implementers can, of course, choose whatever level of risk they can handle when doing early implemention of provisionally approved characters which have not yet been formally published in the standards. But if they guess wrong and implement a character (in a font or in anything else) that is moved at some point in the ballotting, then that was just the risk they took, and they can't expect to come back to the committees bearing complaints and grievances about it. If you, for example, want to put U+267F HANDICAPPED SIGN in a font now, nobody will stop you, but bear in mind that this character is only at Stage 1 of the ISO process -- it has not yet been considered or even provisionally approved by WG2. Not only is the name likely to change (based on all the issues already discussed), but it is conceivable that WG2 could decide to approve it at some other code position instead. It is even conceivable that WG2 could *refuse* to encode the character. There have been precedents, where a UTC approved character met opposition in WG2, and the UTC later decided to rescind its approval in favor of maintaining synchronization
Re: Question about Unicode Ranges in TrueType fonts
On Thursday, June 26, 2003 8:16 PM, Elisha Berns <[EMAIL PROTECTED]> wrote: > It would appear from your answer that even after implementing the > algorithm to search the Unicode block coverage of a font, the actual > comparison "data", that is which blocks to compare and how many code > points, is totally undefined. Is there any kind of standard for > defining what codepoints are required to write a given language? This > seems like the issue that fontconfig gets around by using all those > .orth files which define the codepoints for a given language. But is > there any standardized set of language required codepoint definitions > that could be used? > > Anyways, where is the up-to-date list of Unicode blocks to be found? On the Unicode.org website or its published book. > It's odd to think that the old way of using Charset identifiers in > fonts worked a lot more cleanly for finding fonts matching a > language/language group. I would think this kind of core issue would > be addressed more cleanly by the font standard. The ICU datafiles contain such list of codes needed to cover almost completely each combination of language+script. Now these datafiles are shared across multiple implementations with the I18n initiative project, which tries to define a common source of locale data for multiple vendors (preiously this project was in li18nux.org now extended to cover other open systems than Linux, such as most BSD and Unix variants, with a joint effort with the GNU project and other Unix and Java solution providers)... Of course, nothing forbids a particular text to use other characters than those strictly needed for a particular language...
RE: Question about Unicode Ranges in TrueType fonts
Elisha Berns asked: > It would appear from your answer that even after implementing the > algorithm to search the Unicode block coverage of a font, the actual > comparison "data", that is which blocks to compare and how many code > points, is totally undefined. Is there any kind of standard for > defining what codepoints are required to write a given language? This > seems like the issue that fontconfig gets around by using all those > .orth files which define the codepoints for a given language. But is > there any standardized set of language required codepoint definitions > that could be used? Not a standard that I know of, but there are a number of compilations of what *characters* are required for the alphabets of various languages. See, for example: http://www.evertype.com/alphabets/index.html for European languages. From each list of characters it is fairly straightforward to derive what Unicode encoded characters would be required to support that list. http://www.eki.ee/itstandard/ladina/ is another source. This goes a little further afield into languages using Cyrillic characters, and also provides information about Unicode encodings directly. Note that for any such listing, you still need to take into account what punctuation or other characters might also be needed for the language's conventional orthography/ies, since the typical listing you will find is only for the alphabetic characters used by the language. > > Anyways, where is the up-to-date list of Unicode blocks to be found? http://www.unicode.org/Public/UNIDATA/Blocks.txt > > It's odd to think that the old way of using Charset identifiers in fonts > worked a lot more cleanly for finding fonts matching a language/language > group. I would think this kind of core issue would be addressed more > cleanly by the font standard. Which font standard? And this is an area where implementation strategies still seem to be in ferment. At some point this may settle down and be the subject of standardization, but premature standardization can also be a problem if the wrong choices get codified too soon. --Ken
Re: Question about Unicode Ranges in TrueType fonts
Elisha Berns scripsit: > It's odd to think that the old way of using Charset identifiers in fonts > worked a lot more cleanly for finding fonts matching a language/language > group. I would think this kind of core issue would be addressed more > cleanly by the font standard. Actually it worked by dumb luck (or market forces if you prefer). There was never any guarantee that because a font was encoded by Latin-1 that it contained glyphs for all the Latin-1 characters. -- All Gaul is divided into three parts: the part John Cowan that cooks with lard and goose fat, the partwww.ccil.org/~cowan that cooks with olive oil, and the part thatwww.reutershealth.com cooks with butter. -- David Chessler[EMAIL PROTECTED]
Re: WHEELCHAIR (was Revised N2586R)
WHEELCHAIR SYMBOL at least has the virtue of being descriptive of the symbol rather than of the use and thus potentially more neutral all the way around. K - Original Message - From: "Michael Everson" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Thursday, June 26, 2003 2:13 PM Subject: Re: Revised N2586R > At 12:09 -0500 2003-06-26, [EMAIL PROTECTED] wrote: > > >The only meaning that the Standard implies is that the character encoded > >at codepoint x represents they symbol of a wheelchair. It does not imply > >*anything* about how its usage in juxtaposition with the name of a person > >should be interpreted. > > Indeed William's argument that "HANDICAPPED" is somehow inappropriate > just doesn't wash. In Europe at least, many handicapped people > consider it far more polite to be called handicapped or behindert or > what have you than to be subject to such politically "correct" > monstrosities as "differently abled". > > Which is not to say that the Name Police won't prefer WHEELCHAIR > SYMBOL. Time will tell. > -- > Michael Everson * * Everson Typography * * http://www.evertype.com > >
Re: Nightmares
William Overington scripsit: > This issue has arisen because of my concern that a particular symbol has > been labelled as HANDICAPPED SIGN. I hope that the name will be changed to > WHEELCHAIR SYMBOL. If you are going to discriminate (invidiously) using a computerized database, using H for Handicapped (or G for Gimp) will do just as well. Are you going to complain about the various symbols of religion already encoded on the same grounds? -- All Norstrilians knew what laughter was:John Cowan it was "pleasurable corrigible malfunction".http://www.reutershealth.com --Cordwainer Smith, _Norstrilia_[EMAIL PROTECTED]
Re: Question about Unicode Ranges in TrueType fonts
On Thursday, June 26, 2003 4:13 PM, Andrew C. West <[EMAIL PROTECTED]> wrote: > On Thu, 26 Jun 2003 14:26:13 +0200, "Philippe Verdy" wrote: > > > Isn't there a work-around with the following function (quote from > > Microsoft MSDN): > > (with the caveat that you first need to allocate and fill a Unicode > > string for the > > codepoints you want to test, and this can be lengthy if one wants > > to retreive the full list of supported codepoints). > > However, this is still the best function to use to know if a string > > can effectively > > be rendered before drawing it... > > > > _*GetGlyphIndices*_ > > > > GetGlyphIndices() or Uniscribe's ScriptGetCMap() would be OK for > checking coverage for small Unicode blocks such as Gothic (27 > codepoints) or even Mathematical Alphanumeric Symbols (992 > codepoints), but I suspect your application would freeze if you tried > to use it to work out exact codepoint coverage of CJK-B (42,711 > codepoints) and PUA-A and PUA-B (65,534 codepoints each). That's why I added the comment. For an effective application however, this is a great way to check if a given text will be effectively displayed. If not, one can use other Uniscribe functions to perform additional mappings, and if this fails, one can add another TrueType font to a logical font, by selecting among those that have a script bit set in their descriptors. The application may propose to users to select a prefered order for all fonts having a script bit set in this descriptor. Then the application will create a logical font for that script using this preference order. But if there's no font in the collection that contains the glyph, there will beno other choice than displaying the substitution glyph of the first font (such as a rectangle bullet) normally bound to U+FFFD unless the font descriptor specifies a specific glyph. Other strategies are for the application to create one logical font per language, if the text to render is labelled (out-of-band) with a language indicator. This gives more coherent results than creating a logical font per supperted script, notably on Latin-based languages with many characters such as Vietnamese... So if a markup language specifies a font family, the font stack will include this family on top of the stack, followed by the fonts for the language+script combination, followed by the fonts for a particular script, and followed then by all prefered fonts for any scripts, and finally followed by all other fonts. -- Philippe.
RE: Question about Unicode Ranges in TrueType fonts
Andrew West wrote: > By looping through the "ranges" array it is possible to determine exactly > which > characters in which Unicode blocks a given font covers (as long as your > sofware > has an array of Unicode blocks and their codepoint ranges). > As long as your software has an up-to-date list of > the > Unicode blocks and their constituent codepoints for the latest version of > Unicode, you will always be able to get up to date information about > Unicode > coverage of a font. > > If you want to determine language coverage for a particular > font, > then all you need to do is define a minimum set of codepoints that must be > covered for a particular block or set of blocks to be considered as > supporting > that language. (Just the little matter of deciding what the minimum set of > codepoints would be for every language that is supported by Unicode ...) > Thanks so much for the detailed reply. It would appear from your answer that even after implementing the algorithm to search the Unicode block coverage of a font, the actual comparison "data", that is which blocks to compare and how many code points, is totally undefined. Is there any kind of standard for defining what codepoints are required to write a given language? This seems like the issue that fontconfig gets around by using all those .orth files which define the codepoints for a given language. But is there any standardized set of language required codepoint definitions that could be used? Anyways, where is the up-to-date list of Unicode blocks to be found? It's odd to think that the old way of using Charset identifiers in fonts worked a lot more cleanly for finding fonts matching a language/language group. I would think this kind of core issue would be addressed more cleanly by the font standard. Thanks for any help. Yours truly, Elisha Berns
Re: Revised N2586R
At 12:09 -0500 2003-06-26, [EMAIL PROTECTED] wrote: The only meaning that the Standard implies is that the character encoded at codepoint x represents they symbol of a wheelchair. It does not imply *anything* about how its usage in juxtaposition with the name of a person should be interpreted. Indeed William's argument that "HANDICAPPED" is somehow inappropriate just doesn't wash. In Europe at least, many handicapped people consider it far more polite to be called handicapped or behindert or what have you than to be subject to such politically "correct" monstrosities as "differently abled". Which is not to say that the Name Police won't prefer WHEELCHAIR SYMBOL. Time will tell. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Revised N2586R
At 13:03 +0100 2003-06-26, William Overington wrote: Well, certainly authority would be needed, yet I am suggesting that where a few characters added into an established block are accepted, which is what is claimed for these characters, there should be a faster route than having to wait for bulk release in Unicode 4.1. No, there shouldn't. The process will not be changed. Unicode and ISO/IEC 10646 are synchronized, and JTC1 ballotting processes are what they are. No further discussion is necessary, as it is pointless. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Major Defects in Subject Lines!
Wow... How on earth did the subject line "Major Defect in Combining Classes of Tibetan Vowels" turn into a discussion of Biblical Hebrew? At least, people, if you're going to transmogrify the discussion, please use a subject line such as "Biblical Hebrew" which someone already was wise enough to start using on some pieces of this thread. Thanks, Rick (All my own opinions, of course)
Re: Revised N2586R
William Overington wrote on 06/26/2003 07:03:12 AM: > yet I am suggesting that where a > few characters added into an established block are accepted, which is what > is claimed for these characters, there should be a faster route than having > to wait for bulk release in Unicode 4.1. Once both UTC and WG2 have approved the assignment of characters to particular codepoints, I might risk making fonts using those codepoints for those characters, as it's not very likely the codepoints will be changed at that point. There's no guarantee that would not happen, however, so I certainly wouldn't distribute such fonts if I were a commercial foundary -- too much at stake. If an ammendment to ISO 10646 gets published prior to a new version of Unicode, though, that would constitute a guarantee the codepoints will not change. > If these characters have been > accepted, why not formally warrant their use now by having Unicode 4.001 > and then having Unicode 4.002 when a few more are accepted? That is not how versioning is done with the standard. Please read http://www.unicode.org/standard/versions/ > Some fontmakers can react to new > releases more quickly than can some other fontmakers, so why should progress > be slowed down for the benefit of those who cannot add new glyphs into fonts > quickly? Fontmakers don't need to wait until a new version is published before they start preparing fonts. > For example, symbols for audio description, subtitles and signing are needed > for broadcasting. Will that need to have years of waiting and using the > Private Use Area when it could be a fairly swift process and the characters > could be implemented into read-only memories in interactive television sets > that much sooner? Well, if the characters haven't even been proposed for addition to the standard, then yes, it will take years of PUA usage. > Why is it that it is regarded by the Unicode Consortium > as reasonable that it takes years to get a character through the committees > and into use? Because there is a process that takes time. International standards aren't created by a few people working out of their garage. Some international standards take far longer than do updates to Unicode. > Surely where a few characters are needed the Unicode > Consortium and ISO need to take a twenty-first century attitude to getting > the job done It might be a good idea to become more familiar with the actual process and work on international standards in general before criticizing the people doing the work. There are a number of people working quite hard on this stuff, with their time being volunteered by the organizations and companies they represent, or from their own personal time. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Revised N2586R
William Overington wrote on 06/26/2003 06:24:44 AM: > > the name is simply a unique identifier within the std. > > Well, the Standard is the authority for what is the meaning of the symbol > when found in a file of plain text. So if the symbol is in a plain text > file before or after the name of a person then the Standard implies a > meaning to the plain text file. The only meaning that the Standard implies is that the character encoded at codepoint x represents they symbol of a wheelchair. It does not imply *anything* about how its usage in juxtaposition with the name of a person should be interpreted. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)
Jony Rosenne wrote on 06/26/2003 06:26:02 AM: > It may look, silly, but it is correct. What you see are letters according to > the writing tradition, which does not include a Yod, and vowels according to > the reading tradition which does. I understand that. My point was, you were talking about phonology, but in terms of the text, it was not correct: there *are* multiple vowels on a single consonant. > There are in the Bible other, more extreme > cases. I'd be interested on whatever info you can provide in that regard. > I don't think we need any new characters, ZERO WIDTH SPACE would do and it > requires no new semantics. No, that's a terrible solution: a space creates unwanted word boundaries. > Moreover, everybody who knows his Hebrew Bible > knows the Yod is there although it isn't written. But the point is, how to people encode the text? The yod is not there in the text. How does a publisher encode text in the typesetting process? How do researchsers encode the text they want to analyze? Saying, "everybody knows there's a yod there" doesn't provide a solution, particular given that the researchers know in point of fact that the consonantal text explicitly does not include a yod. > The Meteg is a completely different issue. There is a small number of places > were the Meteg is placed differently. Since it does not behave the same as > the regular Meteg, and is thus visually distinguishable, it should be > possible to add a character, as long as it is clearly named. That is a potential solution, thought it would have to be *two* additional metegs. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)
At 04:26 AM 6/26/2003, Jony Rosenne wrote: I don't think we need any new characters, ZERO WIDTH SPACE would do and it requires no new semantics. ZERO WIDTH SPACE would screw up search and sort algorithms, I think, because it is not a control character per se and may not be ignored as desired. I've made some tests using Ken's ZWJ suggestion and, as feared, it messes with the glyph positioning lookups. The results varied slightly between MS RichText clients and InDesign ME, but both displayed marks incorrectly when ZWJ was inserted. I strongly suspect that this is not something that can easily be resolved in the glyph shaping model. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 12:43 AM 6/26/2003, [EMAIL PROTECTED] wrote: > The problem of combinations of vowels with meteg could be > amenable to a similar approach. OR, one could propose just > one additional meteq/silluq character, to make it possible > to distinguish (in plain text) instances of left-side and > right-side meteq placement, for example. And the third position of meteg with hataf vowels? Introduce *two* additional meteg/silluq characters? No, that's a glyph ligation matter however you look at it. It could be made to work with either just a left meteg or also with a new right meteg, and can be inhibited with ZWNJ. This is not to say that I think encoding a distinct right meteg character is the best solution, only that it doesn't affect the medial meteg shaping. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Revised N2586R
William Overington wrote: > Well, certainly authority would be needed, yet I am suggesting that > where a few characters added into an established block are accepted, > which is what is claimed for these characters, there should be a > faster route than having to wait for bulk release in Unicode 4.1. > If these characters have been accepted, why not formally warrant their > use now by having Unicode 4.001 and then having Unicode 4.002 when a > few more are accepted? These minor additions to the Standard could be > produced as characters are accepted and publicised in the Unicode > Consortium's webspace. If the characters have not been accepted then > they cannot be considered ready to be used, yet if they have been > accepted, what is the problem in releasing them so that people who > want to get on with using them can do so? Some fontmakers can react > to new releases more quickly than can some other fontmakers, so why > should progress be slowed down for the benefit of those who cannot > add new glyphs into fonts quickly? That's just the way standards work. You have to wait until final, FINAL approval and official release before you can do newly approved things conformantly. There has to be a chance for the authority at the very end of the process to say, "Wait a minute, I see a problem, this can't go out like this." Dealing with a problem that slipped through because the process was "fast-tracked" or sidestepped is much more expensive than waiting for the process to run its course. This is not "a nonsense," it makes a lot of sense for anyone who's seen what can happen when process is ignored. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Question about Unicode Ranges in TrueType fonts
On Thu, 26 Jun 2003 14:26:13 +0200, "Philippe Verdy" wrote: > Isn't there a work-around with the following function (quote from Microsoft > MSDN): > (with the caveat that you first need to allocate and fill a Unicode string for > the > codepoints you want to test, and this can be lengthy if one wants to retreive the > full list of supported codepoints). > However, this is still the best function to use to know if a string can > effectively > be rendered before drawing it... > > _*GetGlyphIndices*_ > GetGlyphIndices() or Uniscribe's ScriptGetCMap() would be OK for checking coverage for small Unicode blocks such as Gothic (27 codepoints) or even Mathematical Alphanumeric Symbols (992 codepoints), but I suspect your application would freeze if you tried to use it to work out exact codepoint coverage of CJK-B (42,711 codepoints) and PUA-A and PUA-B (65,534 codepoints each). Andrew
Re: Question about Unicode Ranges in TrueType fonts
On Thursday, June 26, 2003 2:26 PM, Philippe Verdy <[EMAIL PROTECTED]> wrote: I forgot also the probably better function from the Uniscribe library, which processes strings through a language-dependant shaping algorithm, and can determine appropriate glyph substitution, or use custom composite fonts to process character clusters into grapheme clusters with 1-to-1, 1-to-N, N-to-1, or N-to-M substitutions, using either the "cmap" table of classic TrueType fonts (which do not support characters out of the BMP), or the new tables added in OpenType fonts. -- Philippe. source: Microsoft MSDN: *ScriptGetCMap* The *ScriptGetCMap* function takes a string and returns the glyph indices of the Unicode characters according to the TrueType cmap table or the standard cmap table implemented for old style fonts. HRESULT WINAPI ScriptGetCMap( HDC hdc, SCRIPT_CACHE *psc, const WCHAR *pwcInChars, int cChars, DWORD dwFlags, WORD *pwOutGlyphs ); *Parameters* /hdc/ [in] Handle to the device context. This parameter is optional. /psc/ [in/out] Pointer to a SCRIPT_CACHE structure. /pwcInChars/ [in] Pointer to a string of Unicode characters. /cChars/ [in] Number of Unicode characters in pwcInChars. /dwFlags/ [in] Flag that specifies any special handling of the glyphs. By default, the glyphs of the buffer are given in logical order with no special handling. This parameter can be the following value. - Value - Meaning - SGCM_RTL - Indicates the glyph array pwOutGlyps should contain mirrored glyphs for those glyphs that have a mirrored equivalent. /pwOutGlyphs/ [out] Pointer to an array that receives the glyph indexes. *Return Values* If all Unicode code points are present in the font, the return value is S_OK. If the function fails, it may return one of the following nonzero values. - Return value - Meaning - E_HANDLE - The font or the system does not support glyph indices. - S_FALSE - Some of the Unicode code points were mapped to the default glyph. If any other unrecoverable error is encountered, it is returned as an HRESULT. *Remarks* ScriptGetCMap may be used to determine which characters in a run are supported by the selected font. The caller may scan the returned glyph buffer looking for the default glyph to determine which characters are not available. The default glyph index for the selected font should be determined by calling ScriptGetFontProperties. The return value indicates the presence of any missing glyphs. Note that some code points can be rendered by a combination of glyphs as well as by a single glyph -- for example, 00C9; LATIN CAPITAL LETTER E WITH ACUTE. In this case, if the font supports the capital E glyph and the acute glyph but not a single glyph for 00C9, ScriptGetCMap will show 00C9 is unsupported. To determine the font support for a string that contains these kinds of code points, call ScriptShape. If it returns S_OK, check the output for missing glyphs. *Requirements* - Windows NT/2000/XP: Included in Windows 2000 and later. - Redistributable: Requires Internet Explorer 5 or later on Windows 95/98/Me. - Header: Declared in Usp10.h. - Library: Use Usp10.lib. *See Also* Uniscribe Overview, Uniscribe Functions, ScriptGetFontProperties, ScriptShape, SCRIPT_CACHE
Re: Question about Unicode Ranges in TrueType fonts
On Thursday, June 26, 2003 11:50 AM, Andrew C. West <[EMAIL PROTECTED]> wrote: > On Wed, 25 Jun 2003 21:58:28 -0700, "Elisha Berns" wrote: > > > Some weeks back there were a number of postings about software for > > viewing Unicode Ranges in TrueType fonts and I had a few questions > > about that. Most viewers listed seemed to only check the Unicode > > Range bits of the fonts which can be misleading in certain cases. > > Now the caveat. The USB sets a Surrogates bit to indicate that the > font contains at least one codepoint beyond the Basic Multilingual > Plane (BMP). Unfortunately the "ranges" array of the GLYPHSET > structure only lists contiguous clumps of Unicode codepoints within > the BMP (wcLow is a 16 bit value), and does not list surrogate > coverage. Therefore you cannot determine supra-BMP codepoint coverage > from the GLYPHSET structure. If anyone does know an easy way to do > this under Windows, please let me know. Isn't there a work-around with the following function (quote from Microsoft MSDN): (with the caveat that you first need to allocate and fill a Unicode string for the codepoints you want to test, and this can be lengthy if one wants to retreive the full list of supported codepoints). However, this is still the best function to use to know if a string can effectively be rendered before drawing it... -- Philippe. _*GetGlyphIndices*_ The *GetGlyphIndices* function translates a string into an array of glyph indices. The function can be used to determine whether a glyph exists in a font. DWORD GetGlyphIndices( HDC hdc, // handle to DC LPCTSTR lpstr, // string to convert int c, // number of characters in string LPWORD pgi,// array of glyph indices DWORD fl // glyph options ); _Parameters_ /hdc / [in] Handle to the device context. /lpstr/ [in] Pointer to the string to be converted. /c/ [in] Length of the string in pgi. For the ANSI function it is a BYTE count and for the Unicode function it is a WORD count. Note that for the ANSI function, characters in SBCS code pages take one byte each, while most characters in DBCS code pages take two bytes; for the Unicode function, most currently defined Unicode characters (those in the Basic Multilingual Plane (BMP)) are one WORD while Unicode surrogates are two WORDs. /pgi/ [out] Array of glyph indices corresponding to the characters in the string. /fl/ [in] Specifies how glyphs should be handled if they are not supported. This parameter can be the following value. Value - Meaning GGI_MARK_NONEXISTING_GLYPHS - Marks unsupported glyphs with the hexadecimal value 0x. _Return Values_ If the function succeeds, it returns the number of bytes (for the ANSI function) or WORDs (for the Unicode function) converted. If the function fails, the return value is GDI_ERROR. *Windows NT/2000/XP*: To get extended error information, call *GetLastError*. _Requirements_ - Windows NT/2000/XP: Included in Windows 2000 and later. - Windows 95/98/Me: Unsupported. - Header: Declared in Wingdi.h; include Windows.h. - Library: Use Gdi32.lib. - Unicode: Implemented as Unicode and ANSI versions.
Re: Nightmares
Tom Gewecke wrote as follows. > My personal idea of an Orwellian nightmare would to have a committee of "vigilant freedom protectors" evaluating the "political and social implications of encoding symbols" and passing judgement on whether particular characters should be encoded and what their names should not be. Yes, I agree that would be terrible. The difference of your personal idea of an Orwellian nightmare from what I am suggesting should take place is great. I am suggesting that everybody, as part of their activity in character encoding, be vigilant that what is encoded does not provide an infrastructure for an Orwellian nightmare to take place with computing systems such as databases. The difference is like a country having a special "riot police" force and having regular police who wear riot gear when the need arises. This distinction was stressed when police in riot gear were first seen on the streets in England, as the television news began by using the term "riot police". So I am not suggesting such a committee, just ordinary regular people who encode characters being vigilant about the political and social implications of what they are doing, lest by not concerning themselves with such an important aspect of their work, namely the potential for causing misery, the opportunity for such misery to occur is unthinkingly provided or is not prevented when it easily could be prevented. Hopefully this will clarify my thinking to you and hopefully be of interest to people involved in character encoding discussions. One of the great issues of the last century was as to whether scientists should consider the political and social implications of their work or just work as if somehow separate from society and leave the application of the things which they discovered and developed to politicians and business people. This issue has arisen because of my concern that a particular symbol has been labelled as HANDICAPPED SIGN. I hope that the name will be changed to WHEELCHAIR SYMBOL. Yet what if my concerns over the need for vigilance were now dismissed? What characters might be encoded in the future with what names? After all, if no one is willing to be vigilant because that very vigilance is regarded as an Orwellian nightmare, there would then be no constraints. I am very much someone who believes in the need for checks and balances. I feel that we need checks and balances in what is encoded and what names are applied to symbols. I also feel that we need checks and balances as to how those checks and balances are carried out. William Overington 26 June 2003
Re: Revised N2586R
Peter Constable wrote as follows. > the name is simply a unique identifier within the std. Well, the Standard is the authority for what is the meaning of the symbol when found in a file of plain text. So if the symbol is in a plain text file before or after the name of a person then the Standard implies a meaning to the plain text file. > A name may be somewhat indicative of it's function, but is not necessarily so. Well, that could ultimately be an issue before the courts in a libel case if someone publishes a text with a symbol next to someone's name. A key issue might well be as to what is the defined meaning of the symbol in the Standard. Certainly, the issue of what a reasonable person seeing that symbol next to someone's name might conclude is being published about the person might well also be important, even if that meaning is not in the Standard. > You could call it WHEELCHAIR SYMBOL, but that engineering of the standard is not also social engineering, and people may still use it to label individuals in a way that may be violating human rights -- we cannot stop that. No matter what we call it, end users are not very likely going to be aware of the name in the standard; they're just going to look for the shape, and if they find it, they'll use it for whatever purpose they chose to. Certainly. Yet a plain text interchangeable file would not have the meaning built into it by the Standard. I agree though that there may well still be great problems. William Overington 26 June 2003
Re: Revised N2586R
Michael Everson wrote as follows. >At 08:44 -0700 2003-06-25, Doug Ewell wrote: > >>If it's true that either the UTC or WG2 has formally approved the character, for a future version of Unicode or a future amendment to 10646, then I don't see any reason why font makers can't PRODUCE a font with a glyph for the proposed character at the proposed code point. >>They just can't DISTRIBUTE the font until the appropriate standard is released. >That's correct. Well, certainly authority would be needed, yet I am suggesting that where a few characters added into an established block are accepted, which is what is claimed for these characters, there should be a faster route than having to wait for bulk release in Unicode 4.1. If these characters have been accepted, why not formally warrant their use now by having Unicode 4.001 and then having Unicode 4.002 when a few more are accepted? These minor additions to the Standard could be produced as characters are accepted and publicised in the Unicode Consortium's webspace. If the characters have not been accepted then they cannot be considered ready to be used, yet if they have been accepted, what is the problem in releasing them so that people who want to get on with using them can do so? Some fontmakers can react to new releases more quickly than can some other fontmakers, so why should progress be slowed down for the benefit of those who cannot add new glyphs into fonts quickly? For example, symbols for audio description, subtitles and signing are needed for broadcasting. Will that need to have years of waiting and using the Private Use Area when it could be a fairly swift process and the characters could be implemented into read-only memories in interactive television sets that much sooner? Why is it that it is regarded by the Unicode Consortium as reasonable that it takes years to get a character through the committees and into use? Surely where a few characters are needed the Unicode Consortium and ISO need to take a twenty-first century attitude to getting the job done for people's needs rather than having the sort of delays which might have been acceptable in days gone by. The idea of having to use the Private Use Area for a period after the characters have been accepted is just a nonsense. William Overington 26 June 2003
RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)
It may look, silly, but it is correct. What you see are letters according to the writing tradition, which does not include a Yod, and vowels according to the reading tradition which does. There are in the Bible other, more extreme cases. I don't think we need any new characters, ZERO WIDTH SPACE would do and it requires no new semantics. Moreover, everybody who knows his Hebrew Bible knows the Yod is there although it isn't written. The Meteg is a completely different issue. There is a small number of places were the Meteg is placed differently. Since it does not behave the same as the regular Meteg, and is thus visually distinguishable, it should be possible to add a character, as long as it is clearly named. Jony > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > [EMAIL PROTECTED] > Sent: Thursday, June 26, 2003 9:43 AM > To: [EMAIL PROTECTED] > Subject: Re: Major Defect in Combining Classes of Tibetan > Vowels (Hebrew) > > > Jony Rosenne wrote on 06/26/2003 12:16:22 AM: > > > When, in the Bible, one sees two vowels on a given > consonant, it isn't > so. > > That's silly. When one sees two vowels on a given consonant > in the Bible, > it *is* so: the two vowels are written there. It may not > correspond to > actual phonology, ie what is spoken, but as has been made > clear on many > occasions, Unicode is not encoding phonology, it is encoding > text. And in > relation to text, your statement is simply wrong. > > > > There is one vowel for the consonant one sees, and another > vowel for an > > invisible consonant. The proper way to encode it is to use > some code to > > represent the invisible consonant. Then the problem > mentioned below does > not > > arise. > > The idea of an invisible consonant would amount to encoding a > phonological > entity, which is the kind of thing that was at one time > approved for Khmer > (invisible characters representing inherent vowels), but > later turned into > an albatross, and when I proposed the same thing (invisible inherent > vowel) for Syloti Nagri, it was made very clear to me that it > would not go > down well with UTC. > > Also, the proposed solution of an invisible consonant would leave > unresolved the problem of meteg-vowel ordering distinctions, > while the > alternate proposal of having meteg and vowels all with a class of 230 > solves both problems at once. Two ad hoc solutions (one for > multi-vowel > ordering, and another for meteg-vowel ordering) must > certainly be far less > preferred for one motivated solution (having characters with > canonical > combining classes that are appropriate for the writing behaviours > exhibited). > > I invite people to review the discussions from the unicoRe > list from last > December, at which time everyone (including you, Jony) were > all concluding > that the solution which I proposed in L2/03-195 was the best > solution to > pursue. > > > - Peter > > > -- - > Peter Constable > > Non-Roman Script Initiative, SIL International > 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA > Tel: +1 972 708 7485 > > >
Re: Question about Unicode Ranges in TrueType fonts
On Wed, 25 Jun 2003 21:58:28 -0700, "Elisha Berns" wrote: > Some weeks back there were a number of postings about software for > viewing Unicode Ranges in TrueType fonts and I had a few questions about > that. Most viewers listed seemed to only check the Unicode Range bits of > the fonts which can be misleading in certain cases. For W2K and XP only, Microsoft provides an API for determining exactly which Unicode codepoints a font covers. GetFontUnicodeRanges() in the Platform SDK fills a GLYPHSET structure with Unicode coverage information for the currently selected font in a given device context. The GLYPHSET structure has these members : cGlyphsSupported - Total number of Unicode code points supported in the font cRanges - Total number of Unicode ranges in ranges ranges - Array of Unicode ranges that are supported in the font Note that "cRanges" is not the number of Unicode blocks supported, and "ranges" is not an array of Unicode blocks. Rather "ranges" is an array of WCRANGE structures that specify contiguous clumps of Unicode codepoints, and "cRanges" is the number of contiguous clumps of Unicode codepoints. The WCRANGE structure has the following members : wcLow - Low Unicode code point in the range of supported Unicode code points cGlyphs - Number of supported Unicode code points in this range By looping through the "ranges" array it is possible to determine exactly which characters in which Unicode blocks a given font covers (as long as your sofware has an array of Unicode blocks and their codepoint ranges). Note that unlike the Unicode Subfield Bitfield (USB) that is part of the FONTSIGNATURE structure that is filled by GetTextCharsetInfo() etc. [available to W9X and NT as well as 2K/XP), which is limited to a particular version of Unicode (3.0 ?) and returns supersets of Unicode blocks, the GLYPHSET structure is version-independant. As long as your software has an up-to-date list of the Unicode blocks and their constituent codepoints for the latest version of Unicode, you will always be able to get up to date information about Unicode coverage of a font. This is the method used in my BabelMap utility, and you will note that it is therefore able to not only list what Unicode 4.0 blocks are covered by a particular font, but also give the exact number of codepoints that are covered in that block. If you want to determine language coverage for a particular font, then all you need to do is define a minimum set of codepoints that must be covered for a particular block or set of blocks to be considered as supporting that language. (Just the little matter of deciding what the minimum set of codepoints would be for every language that is supported by Unicode ...) Now the caveat. The USB sets a Surrogates bit to indicate that the font contains at least one codepoint beyond the Basic Multilingual Plane (BMP). Unfortunately the "ranges" array of the GLYPHSET structure only lists contiguous clumps of Unicode codepoints within the BMP (wcLow is a 16 bit value), and does not list surrogate coverage. Therefore you cannot determine supra-BMP codepoint coverage from the GLYPHSET structure. If anyone does know an easy way to do this under Windows, please let me know. Regards, Andrew
IUC23 Unicode conference exhibitors' panel report
Hi, For those of you that couldn't attend and were interested in the exhibitor's panel at the last Unicode conference, a brief summary is now online at: http://www.unicode.org/iuc/iuc23/showcase-report.html If you have any comments or feedback on the page, I would be glad to receive it off-list. tex
Re: Uquivalence of some Japanes charcaters in Unicode
Sourav, Hi, your question is ambiguous to me. You seem to be referring to the fullwidth space and other "wide" or "fullwidth" characters. For the fullwidth space look at u+3000 ideographic space. Unicode has other fullwidth characters encoded. Look at the code charts... hth tex souravm wrote: > > Hi All, > > I have a doubt regarding existence of certain Japanese characters in > Unicode. > > The characters I'm referring are those like "Double byte space" which > one can get from old NEC machines or can be entered thru Japanese > keyboard only. > > Can anyone please throw some light on this ? > > Regards, > Sourav -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
24th Unicode Conference - Atlanta, GA - September 3-5, 2003
Twenty-fourth Internationalization and Unicode Conference (IUC24) Unicode, Internationalization, the Web: Powering Global Business http://www.unicode.org/iuc/iuc24 September 3-5, 2003 Atlanta, GA NEWS > Visit the Conference Web site ( http://www.unicode.org/iuc/iuc24 ) to check the updated Conference program and register. To help you choose Conference sessions, we've included abstracts of talks and speakers' biographies. > Hotel guest room group rate valid to August 12. > Early bird registration rates valid to August 12. > To find out about, and register for the TILP Breakfast Meeting and Roundtable, organized by The Institute of Localisation Professionals, and taking place at the same venue on September 4, 7:00 a.m.-9:00 a.m., See: http://www.tilponline.org/events/diary.shtml or http://www.unicode.org/iuc/iuc24 Are you falling behind? Version 4.0 of the Unicode Standard is here! Software and Web applications can now support more languages with greater efficiency and lower cost. Do you need to find out how? Do you need to be more competitive around the globe? Is your software upward-compatible with version 4.0? Does your staff need internationalization training? Learn about software and Web internationalization and the new Unicode Standard, including its latest features and requirements. This is the only event endorsed by the Unicode Consortium. The conference will be held September 3-5, 2003 in Atlanta, Georgia and is completely updated. KEYNOTES: Keynote speakers for IUC24 are well-known authors in the Internationalization and Localization industries: Donald De Palma, President, Common Sense Advisory, Inc., and author of "Business Without Borders: A Strategic Guide to Global Marketing", and Richard Gillam, author of "Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard" and a former columnist for "C++ Report". TUTORIALS: This redeveloped and enhanced Unicode 4.0 Tutorial is taught by Dr. Asmus Freytag, one of the major contributors to the standard, and extensively experienced in implementing real-world Unicode applications. Structured into 3 independent modules, you can attend just the overview, or only the most advanced material. Tutorials in Web Internationalization, non-Latin scripts, and more, are offered in parallel and taught by recognized industry experts. CONFERENCE TRACKS: Gain the competitive edge! Conference sessions provide the most up-to-date technical information on standards, best practices, and recent advances in the globalization of software and the Internet. Panel discussions and the friendly atmosphere allow you to exchange ideas and ask questions of key players in the internationalization industry. WHO SHOULD ATTEND?: If you have a limited training budget, this is the one Internationalization conference you need. Send staff that are involved in either Unicode-enabling software, or internationalization of software and the Internet, including: managers, software engineers, systems analysts, font designers, graphic designers, content developers, Web designers, Web administrators, technical writers, and product marketing personnel. CONFERENCE WEB SITE, PROGRAM and REGISTRATION The Conference Program and Registration form are available at the Conference Web site: http://www.unicode.org/iuc/iuc24 CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation ClientSide News L.L.C. Oracle Corporation World Wide Web Consortium (W3C) XenCraft GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. Sign up for the Exhibitors' track as part of the Conference. For more information, please see: http://www.unicode.org/iuc/iuc24/showcase.html CONFERENCE VENUE The Conference will take place at: DoubleTree Hotel Atlanta Buckhead 3342 Peachtree Road Atlanta, GA 30326 Tel: +1-404-231-1234 Fax: +1-404-231-3112 CONFERENCE MANAGEMENT Global Meeting Services Inc. 8949 Lombard Place, #416 San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: [EMAIL PROTECTED] or: [EMAIL PROTECTED] THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of TibetanVowels)
Karljürgen Feuerherm wrote on 06/25/2003 08:31:41 PM: > I was going to suggest something very similar, a ZW-pseudo-consonant of some > kind, which would force each vowel to be associated with one consonant. An invisible *consonant* doesn't make sense because the problem involves more than just multiple written vowels on one consonant; in fact, that is a small portion of the general problem. If we want such a character, it would notionally be a zero-width-canonical-ordering-inhibiter, and nothing more. And I don't particular want to think about what happens when people start sticking this thing into sequences other than Biblical Hebrew ("in unicode, any sequence is legal"). > General question: when does canonical reordering take place? At input time, > at rendering time, at another time? For the purpose for which canonical ordering was intended, it occurs when comparing two strings for "equality" or ordering. In practice, it can occur at *any* time, including transmission (when it is no longer under the control of the author). Some protocols, and notably W3C protocols, require that data be canonically ordered, and recommend that this happen at the earliest point possible, e.g. at input time. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels
John Hudson wrote on 06/25/2003 06:47:44 PM: > >This is not. The Unicode Standard makes no assumptions or claims > >about what the phonological or meaning equivalence of > >or is for Biblical Hebrew. > > But it does make assumptions about the canonical equivalence of the mark > orders and , unless my understanding of > the purpose of combining classes is completely mistaken. Your understanding on this point is correct. > My understanding > is that any ordering of two marks with different combining classes is > canonically equivalent; Yes. > further, I understand that some normalisation forms > will re-order marks to move marks with lower combining class values closer > to the base character. *Every* Unicode normalization form will apply canonical reordering. > * Meteg re-ordering is in some respects even more problematic than > multi-vowel re-ordering And it is because of meteg-vowel ordering distinctions that the ordering of things like patah + hiriq should not be solved in any way other than the two having the same canonical combining class, because that is exactly what will be needed to deal with meteg-vowel ordering distinctions. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of TibetanVowels)
Ken Whistler wrote on 06/25/2003 06:57:56 PM: > People could consider, for example, representation > of the required sequence: > > > > as: > > So, we want to introduce yet *another* distinct semantic for ZWJ? We've got one for Indic, another for Arabic, another for ligatures (similar to that for Arabic, but slightly different). Now another that is "don't affect any visual change, just be there to inhibit reordering under canonical ordering / normalization"? > The presence of a ZWJ (cc=0) in the sequence would block > the canonical reordering of the sequence to hiriq before > qamets. If that is the essence of the problem needing to > be addressed, then this is a much simpler solution which would > impact neither the stability of normalization nor require > mass cloning of vowels in order to give them new combining > classes. Yes, it would accomplish all that; and is groanable kludge. At least with having distinct vowel characters for Biblical Hebrew, we'd come to a point we could forget about it, and wouldn't be wincing every time we considered it. > The problem of combinations of vowels with meteg could be > amenable to a similar approach. OR, one could propose just > one additional meteq/silluq character, to make it possible > to distinguish (in plain text) instances of left-side and > right-side meteq placement, for example. And the third position of meteg with hataf vowels? Introduce *two* additional meteg/silluq characters? - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)
Jony Rosenne wrote on 06/26/2003 12:16:22 AM: > When, in the Bible, one sees two vowels on a given consonant, it isn't so. That's silly. When one sees two vowels on a given consonant in the Bible, it *is* so: the two vowels are written there. It may not correspond to actual phonology, ie what is spoken, but as has been made clear on many occasions, Unicode is not encoding phonology, it is encoding text. And in relation to text, your statement is simply wrong. > There is one vowel for the consonant one sees, and another vowel for an > invisible consonant. The proper way to encode it is to use some code to > represent the invisible consonant. Then the problem mentioned below does not > arise. The idea of an invisible consonant would amount to encoding a phonological entity, which is the kind of thing that was at one time approved for Khmer (invisible characters representing inherent vowels), but later turned into an albatross, and when I proposed the same thing (invisible inherent vowel) for Syloti Nagri, it was made very clear to me that it would not go down well with UTC. Also, the proposed solution of an invisible consonant would leave unresolved the problem of meteg-vowel ordering distinctions, while the alternate proposal of having meteg and vowels all with a class of 230 solves both problems at once. Two ad hoc solutions (one for multi-vowel ordering, and another for meteg-vowel ordering) must certainly be far less preferred for one motivated solution (having characters with canonical combining classes that are appropriate for the writing behaviours exhibited). I invite people to review the discussions from the unicoRe list from last December, at which time everyone (including you, Jony) were all concluding that the solution which I proposed in L2/03-195 was the best solution to pursue. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels
Michael Everson wrote on 06/25/2003 04:36:20 PM: [ re Biblical Hebrew ] > Write it up with glyphs and minimal pairs and people will see the > problem, if any. Or propose some solution. (That isn't "add duplicate > characters".) The only solution that UTC is willing to consider I have already submitted in a proposal (L2/03-195). - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels
John Cowan wrote on 06/25/2003 03:15:21 PM: > I don't understand how the current implementation "breaks BH text". > At worst, normalization may put various combining marks in a non-traditional > order, but all alternative orders are canonically equivalent anyway, and > no (ordinary) Unicode process should depend on any specific order. No, John, there are distinctions in Biblical Hebrew related to ordering, but due to the canonical combining classes these distinctions are all neutralized under canonical ordering / normalization. The alternate orders are canonically equivalent, but should not have been so. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels
Ken Whistler wrote on 06/25/2003 05:29:59 PM: > > The point is that hiriq before patah is *not* > > canonically equivalent to patah before hiriq, > > This is true. > > > except in the erroneous > > assumption of the Unicode Standard: the order of vowels makes words sound > > different and mean different things. > > This is not. Ken, I think you're reading John differently than he intended: the Unicode character sequences < hiriq, patah > and < patah, hiriq > *are* canonically equivalent, but the requirements for Biblical Hebrew are that alternate visual orders would correspond to different vocalizations, and thus the visual ordering of these does matter semantically, and therefore the encoded orders should *not* be canonically equivalent. > The current situation is not optimal for implementations, nor > does canonically ordered text follow traditional preferences > for spelling order -- that we can agree on. But I think the > claims of inadequacy for the representation or rendering > of Biblical Hebrew text are overblown. The serious problem is that the writing distinctions that matter cannot currently be reliably represented, as they are not preserved under canonical ordering / normalization. This is all just a rehash of discussions we had on this list back in December, at which time it was acknowledged that this was the case, and that this was a problem. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485