RE: Bangla: [ZWJ], [VIRAMA] and CV sequences
--- "Unicode (public)" [EMAIL PROTECTED] wrote: Two of the most basic Unicode stability policies dictate that character assignments, once made, are never removed and character names can never change. Step 4 cannot happen; the best that can happen is that the code points in question can be deprecated. The renaming you suggest in 1 cannot happen either. [Gautam]: Well, too bad. I guess we still have an obligation to explore the extent ofsub-optimal solutions that are being imposed upon South-Asian scriptsfor the sake of *backward compatibility* or simply because they are "fait accomplis". (See Peter Kirk's posting on this issue).However, I am by no means suggesting that the fault lies with the Unicode Consortium. The change in the encoding model for the virama can't happen either; there are too many implementations based on it, and there are too many documents out there that use the current encoding model. Your suggestion wouldn't make them unreadable when opened with software that did things the way you're suggesting, but it would change their appearance in ways that are unlikely to be acceptable. [Gautam]: This is again the "fait accompli" argument. We need to *know* whether adopting an alternative model WOULD HAVE BEEN PREFERABLE, even if the option to do so is no longer available to us. The model I am proposing is precisely the one that has been in use for centuries in the Indian grammatical tradition (/ki/ = k+virama+i). I don't think there are too many South-Asian documents out there encoded in Unicode. At any rate converting them would be a rather simple matter of searching for combining forms of vowels and replacing them bythe [VIRAMA][VOWEL]sequence. The TDIL corpora are very small by current standards, and they require extensive reworking anyway. [I preface what follows with the observation that I'm not by any stretch of the imagination an expert on Indic scripts, but I do fancy myself an expert on Unicode.] I'm also pretty sure that using ZWJ as a viramawon't work and isn't intended to work. KA + ZWJ + KA means something totally different from KA + VIRAMA + KA, and I, for one, wouldn't expect them to be drawn the same. U+0915 represents the letter KA with its inherent vowel sound; that is, it represents the whole syllable KA. Two instances of U+0915 in a row would thus represent "KAKA", completely irrespective of how they're drawn. Introducing a ZWJ in the middle would allow the two SYLLABLES to ligate, but there's no ligature that represents "KAKA", so you should get the same appearance as you do without the ZWJ. The virama, on the other hand, cancels the vowel sound on the KA, tu! rning it into K: The sequence KA + VIRAMA + KA represents the syllable KKA, again irrespective of how it is drawn. In other words, ZWJ is intended to change the APPEARANCE of a piece of text without changing its MEANING (there are exceptions in the Arabic script, but this is the general rule). Having KA + ZWJ + KA render as the syllable KKA would break this rule: the ZWJwould be changing the MEANING of the text. Whether the syllable KKA gets drawn with a virama, a half-form, or a ligature is the proper province of ZWJ and ZWNJ, andthis is what they're documented in TUS to do. But ZWJ can't (and shouldn't) be used to turn KAKA into KKA. [Gautam]: I think there is a slight misunderstanding here. The ZWJ I am proposingisscript-specific (each script would have its own), call it "ZWJ PRIME" or even "JWZ" (in order to avoid confusion with ZWJ). It doesn't exist yet and hence has no semantics. JWZ is a piece of formalism. Its meaning would be precisely what we chose to assignto it. It behaves like the existing (script-specific) VIRAMA's except that it also occurs between a consonant and an independent vowel, forcing the latter to show up in its combining form. In this respect, it is in fact *closer* or *more faithful* to the classical VIRAMA model. Call it VIRAMA if you will. The only reason why I don't wish to call it "VIRAMA" is because I plan to use it after a vowel as well, as in: AJWZYJWZAA encoding A+YOPHOLA+AA. If YOPHOLA is assigned an independent code point then this move would be unnecessary and my JWZ would just be theusualVIRAMA withan extended function that would, in fact, make it more compliant with the classical VIRAMA model. Now that we have freed up all thosecode points occupied by the combining forms of vowels by introducing the VIRAMA with extended function, let us introduce an explicit (always visible) VIRAMA. That's all. Maybe it was unfortunate to call U+094D a "virama," since it doesn't necessarily get drawn as a virama (or, indeed, as anything), but it's too late to revisit that decision. No, the decision is not unfortunate because of that, but rather because U+094D doesn't behave like a virama in all respects, and hence my proposal for extension of its functions. For that matter, it may have been amistake to use the virama model to encode
RE: Bangla: [ZWJ], [VIRAMA] and CV sequences
Gautam Sengupta wrote: --- Marco Cimarosti wrote: OK but, then, your ZWJ becomes exactly what Unicode's VIRAMA has always been: [...] You are absolutely right. I am suggesting that the language-specific viramas be retained as script-specific *explicit* viramas that never disappear. In addition, let's have a script-specific ZWJ which behaves in the way you describe in the preceding paragraph. Good, good. We are making small steps forward. What are you really asking for is that each Indic script has *two* viramas: - a soft virama, which is normally invisible and only displays visibly in special cases (no ligatures for that cluster); - a hard virama (or explicit virama, as you correctly called it), which always displays as such and never ligates with adjacent characters. Let's assume that it would be handy to assign these two viramas to different keys on the keyboard. Or, even better, let's assign the soft virama to the plain key and the hard virama to the SHIFT key, OK? To avoid misunderstandings with the term virama, let's label this key JOINER. Now, this is what you *already* have in Unicode! On our hypothetic Bangla keyboard: - the soft virama (the plain JOINER key) is Unicode's BENGALI SIGN VIRAMA; - the hard virama (the SHIFT+JOINER key) is Unicode's BENGALI SIGN VIRAMA+ZWNJ. Not only Unicode allows all of the above, but it also has a third kind of virama, which may or may not be useful in Bangla but is certainly useful in Devanagari and Gujarati: - the half-consonant virama (let's assign it to the ALT+JOINER key in out hypothetical keyboard) which forces the preceding consonant to be displayed as an half consonant, if possible. This is Unicode's BENGALI SIGN VIRAMA+ZWJ. Notice that, once you have these three viramas on your keyboard, you don't need to have keys for ZWJ and ZWNJ, as their only use, in Indic, is after a xxx SIGN VIRAMA. Apart the fact that two of the three viramas are encoded as a *pair* of code points, how does the *current* Unicode model impede you to implement the clean theoretical model that you have in mind? [...] - independent and dependent vowels were the same characters; [...] I agree with you on all of these issues. You have in fact summed up my critique of the ISCII/Unicode model. OK. But are you sure that this critique should necessarily be moved to the *encoding* model, rather than to some other part of the chain. I'll now try to demonstrate how also the redundancy of dependent/independent vowels may be solved at the *keyboard* level. You are certainly aware that some national keyboards have the so-called dead keys. A dead key is a key which does not immediately send (a) character(s) to the application but waits for a second key; in European keyboards dead keys are used to type accented letters. E.g., let's see how accented letters are typed on the Spanish keyboard (which, BTW, is by far the best designed keyboard in Western Europe): 1. If you press the ´ key, nothing is sent to the application, but the keystroke is memorized by the keyboard driver. 2. If you now press one of a, e, i, o, u or y keys, characters á, é, í, ó, ú or ý are sent to the application. 3. If you press the space bar, character ´ itself is sent to the application; 4. If you press any other key, e.g. m, the two characters ´ and m are sent to the application in this order. Now, in the description above substitute: - the ´ key with 0985 BENGALI LETTER A (but let's label it VIRTUAL CONSONANT); - the a ... y keys with 09BE BENGALI VOWEL SIGN AA ... 09CC BENGALI VOWEL SIGN AU; - the á ... ý characters with 0986 BENGALI LETTER AA ... 0994 BEGALI LETTER AU. What you have is a Bangla keyboard where dependent vowels are typed with a single vowel keystroke, and independent vowels are typed with the sequence VIRTUAL CONSONANT+vowel. Do you prefer your cons+VIRAMA+vowel model? Personally, I find it is suboptimal, as it requires, on average, more keystrokes. However, if that's what you want, in the Spanish keyboard description above substitute: - the ´ key with the unshifted JOINER (= virama) key that we have already defined above; - the a ... y keys with 0986 BENGALI LETTER AA ... 0994 BEGALI LETTER AU; - the á ... ý characters with 09BE BENGALI VOWEL SIGN AA ... 09CC BENGALI VOWEL SIGN AU. Now you have a Bangla keyboard where independent vowels are typed with a single keystroke, and dependent vowels are typed with the sequence JOINER+vowel. _ Marco
Re: Bangla: [ZWJ], [VIRAMA] and CV sequences
Ken, I stand corrected. Long syllabic /r l/ as well as Assamese /r v/ are indeed additions beyond the ISCII code chart. My objection, however, was not against their inclusion but against their placement. I understand why long syllabic /r l/ could not be placed with the vowels, but why were Assamese /r v/ assigned U+09F0 and U+09F1 instead of U+09B1 and U+09B5 respectively? --- Kenneth Whistler [EMAIL PROTECTED] wrote: In the case of the Assamese letters, these additions separate out the *distinct* forms for Assamese /r/ and /v/ from the Bangla forms, and *enable* correct sorting, rather than inhibiting it. I fail to understand why Assamese /r v/ wouldn't be correctly sorted if placed in U+09F0 and U+09F1. Why do they need to be separated out from the Bangla forms in order to enable correct sorting? The addition of the long syllabic /r/ and /l/ *enables* the representation of Sanskrit material in the Bengali script, and the code position in the charts is immaterial. As stated earlier, my objection is not against their inclusion, but against their positioning on the code chart. Why is their relative position in the chart immaterial for sorting? If it is merely because there are script-specific sorting mechanisms already in place, then it's just a bad excuse for a sloppy job. I sincerely hope there is more to it than just that. But be that as it may, they (TDIL) have nothing to do with the code point choices in the range U+09E0..U+09FF ... If this is indeed the case, then I must say it's rather unfortunate. As a full corporate member representing the Republic of India, the Ministry of Information Technology should have had a BIG say in the matter. Were they ever consulted on the issue? Did they try to intervene suo moto? Will a Unicode official kindly let us know? Best, -Gautam. __ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com
Re: Euro Currency for UK
On 08/10/2003 16:52, Jain, Pankaj (MED, TCS) wrote: Hi, I Have requirement to display Euro Currency Symbol for en_GB locale.I know that if we use en_GB as CurrencLocale, then it default to Pound.Is there any way I can set it to Euro. Thanks Pankaj Our default currency in the UK is still the pound sterling. It will take more than you changing some settings to change it to the Euro! :-) The Euro symbol is available, and should be displayed correctly if you have a suitable font, in CP1252 and ISO-8859-1 which are the usual legacy encodings used in the UK - and of course in Unicode. I assume you are not using a system from before about 1998 when the Euro was added to systems and fonts. Anything beyond that depends on what system you are referring to, and so is probably not really a matter for this list. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Cursor movement in Hebrew, was: Non-ascii string processing?
On 08/10/2003 21:55, Jungshik Shin wrote: ... I've got a question about the cursor movement and selection in Hebrew text with such a grapheme (made up of 6 Unicode characters). What would be ordinary users' expectation when delete, backspace, and arrow keys(for cursor movement) are pressed around/in the middle of that DGC? Do they expect backspace/delete/arrow keys to operate _always_ at the DGC level or sometimes do they want them to work at the Unicode character level (or its equivalent in their perception of Hebrew 'letters')? Exactly the same question can be asked of Indic scripts. I've asked this before (discussed the issue with Marco a couple of years ago), but I haven't heard back from native users of Indic scripts. Jungshik I can't answer for native users of Hebrew. Maybe others can, but then most modern Hebrew word processing is done with unpointed text where this is not an issue. But I can speak for what has been done with Windows fonts for pointed Hebrew for scholarly purposes. In each of them, as far as I can remember, delete and backspace delete only a single character, not a default grapheme cluster. This is probably appropriate for a font used mainly for scholarly purposes, where representations of complex grapheme clusters may need to be edited to make them exactly correct. A different approach might be more suitable for a font commonly used for entering long texts. In such a case I would tend to expect backspace to cancel one keystroke - but that may be ambiguous of course when editing text which has not just been entered. Cursor movement also works at the character level. In some fonts there is no visible cursor movement when moving over a non-spacing character, which is probably the default but can be confusing to users. At least one font has attempted to place the cursor at different locations within the base character e.g. in the middle when there are two characters in the DGC, at the 1/3 and 2/3 points when there are three characters. But this is likely to get confusing when there are 5 or 6 characters in the DGC and their order is not entirely predictable. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Cursor movement in Hebrew, was: Non-ascii string processing?
Peter Kirk wrote: On 08/10/2003 21:55, Jungshik Shin wrote: ... I've got a question about the cursor movement and selection in Hebrew text with such a grapheme (made up of 6 Unicode characters). What would be ordinary users' expectation when delete, backspace, and arrow keys(for cursor movement) are pressed around/in the middle of that DGC? Do they expect backspace/delete/arrow keys to operate _always_ at the DGC level or sometimes do they want them to work at the Unicode character level (or its equivalent in their perception of Hebrew 'letters')? Exactly the same question can be asked of Indic scripts. I've asked this before (discussed the issue with Marco a couple of years ago), but I haven't heard back from native users of Indic scripts. Jungshik I can't answer for native users of Hebrew. Maybe others can, but then most modern Hebrew word processing is done with unpointed text where this is not an issue. But I can speak for what has been done with Windows fonts for pointed Hebrew for scholarly purposes. In each of them, as far as I can remember, delete and backspace delete only a single character, not a default grapheme cluster. This is probably appropriate for a font used mainly for scholarly purposes, where representations of complex grapheme clusters may need to be edited to make them exactly correct. A different approach might be more suitable for a font commonly used for entering long texts. In such a case I would tend to expect backspace to cancel one keystroke - but that may be ambiguous of course when editing text which has not just been entered. Cursor movement also works at the character level. In some fonts there is no visible cursor movement when moving over a non-spacing character, which is probably the default but can be confusing to users. At least one font has attempted to place the cursor at different locations within the base character e.g. in the middle when there are two characters in the DGC, at the 1/3 and 2/3 points when there are three characters. But this is likely to get confusing when there are 5 or 6 characters in the DGC and their order is not entirely predictable. I'm not a native speaker either, but I do have some occasion to work in both pointed and unpointed Hebrew, and I think I would disagree with Peter here. Certainly in the case of cursor movement, I'd expect the cursor to move by DGCs, and not take some unclear number of keypresses to move back a letter. With backspace/delete, I would probably want that to work by characters within the current DGC, but once past that (or if I'm not doing it immediately after typing the characters) it should take out whole DGCs. They're just too messy and potentially randomly ordered for it to make any sense to try to edit them internally. So I guess I see Hebrew DGCs as also going through a sort of commitment phase, when you type the next base character or use cursor-movement keys to move around: at that point, the DGC should go atomic and get deleted all at once, but so long as you're still typing combining characters (and occasional backspaces), backspace should go character by character (since you presumably can remember the last few you just typed). Mind, I've not actually used all that many pointed-Hebrew text processors; this is more my idea of how things *should* work than how they *do* work. I think Yudit does or did something a bit like this, though. (must have been did: at the moment it seems to be consistent about always doing everything by DGC). ~mark
Re: Euro Currency for UK
The Euro symbol is available, and should be displayed correctly if you have a suitable font, in CP1252 and ISO-8859-1 which are the usual legacy encodings used in the UK - and of course in Unicode. The Euro symbol is not in ISO 8859-1, it is however in ISO 8859-15 and ISO 8859-16. It was added to CP1252 after the inital specification of CP1252 and hence some systems may not render it correctly (especially since the update may have seemed a pointless install to some outside of the jurisdictions in which the Euro is legal tender). I think the question though is how to get some particular locale system to use that symbol as the default currency character.
Re: Cursor movement in Hebrew, was: Non-ascii string processing?
One issue with deleting a DGC non-atomically is that deleting only the base character can lead to all sorts of strange and problematic combining character sequences. At a minimum, deleting a base character should delete the entire DGC atomically. In Hebrew, I don't see any problem with deleting combining characters non-atomically (although one might want to limit this to just off the logical end of the sequence out of user interface considerations). I suppose that this might be more of an issue in some other languages, though. One might be tempted to use some sort of canonical ordering logic to keep the complexity down, but the combining classes for Hebrew are so problematic that this would be a lost cause. I have used software where the cursor moves non-atomically across a DGC in Hebrew and I find it extremely confusing. The only way to make sense of what's happening is to remember the exact sequence in which the combining characters were entered. If someone wants to support such movement anyway, I think that the cursor shape needs to change dramatically to indicate what's going on. This is something I've never seen done well (usually not at all). Subtle changes in cursor position are useless as a visual indication to the user of what's going on. One might even need to include some sort of glyph highlighting to make clear the state of the text entry system. Ted Ted Hopp, Ph.D. ZigZag, Inc. [EMAIL PROTECTED] +1-301-990-7453 newSLATE is your personal learning workspace ...on the web at http://www.newSLATE.com/
Re: Euro Currency for UK
[EMAIL PROTECTED] wrote: The Euro symbol is not in ISO 8859-1, it is however in ISO 8859-15 and ISO 8859-16. It was added to CP1252 after the inital specification of CP1252 and hence some systems may not render it correctly (especially since the update may have seemed a pointless install to some outside of the jurisdictions in which the Euro is legal tender). Isn't Euro support added to all CP1252 versions of Windows 98 and later, and in Windows 95 if people manually visit some Microsoft web page and download an update for this? My copy of iconv for Linux supports in CP1252, and all of my other CP1252-compatible programs (e.g. Mozilla) also seem to support it. Stefan
Re: Euro Currency for UK
Hmm.. this isn't really a Unicode question. You might want to post this question over on the i18n programming list '[EMAIL PROTECTED]' or on the locales list at '[EMAIL PROTECTED]'. You don't say what your programming or operating environments are. There are two possibilities here. If you want to use your existing software to display currencies as the Euro instead of pounds, you can generally either set the display settings (Windows Regional options control panel) for currency to look like the Euro. Or you can set (on Unix systems) the LC_MONETARY locale variable to some locale that uses the Euro with English-like formatting. A few systems actually provide a specialized variant locale for [EMAIL PROTECTED] for this purpose. A few provide an [EMAIL PROTECTED], which won't be helpful to you because of differences in the separators used in the two locales. You can also compile your own locale tables on Unix. Read the man pages on locale. If you are writing your own software, then it really isn't that hard. Some programming environments, such as Java, provide either a separate Currency class with the ability to create specific display-time formats that take both the currency and the display locale into account. Others require you to create a formatter to convert the value into a string for display. In fact, when working with currency it is important to associate which currency you mean with the value. You may experience problems if you create a data field for value and format it according to the machine's runtime locale. The runtime locale can imply a certain default currency, as you note, but default does not mean only. Consider: value123.45/value Not right: en_GB: 123,45 en_US: $123.45 de_DE: 123,45 ja_JP: 123 Most commonly the ISO4217 currency code is associated with a value to create a data structure that is specific: value amount123.45/amount currencyEUR/currency /value en_GB: 123,45 en_US: 123.45 de_DE: 123,45 ja_JP: 123.45 Getting the formatting right is a matter of accessing the formatting fucctions of your programming API correctly. Most programming environments provide a way to format a value using separate locale rules (for grouping and decimal separators) and currency. More information about what you're trying to do would help in recommending a solution. Best Regards, Addison -- Addison P. Phillips Director, Globalization Architecture webMethods, Inc. +1 408.962.5487 mailto:[EMAIL PROTECTED] --- Internationalization is an architecture. It is not a feature. Chair, W3C I18N WG Web Services Task Force http://www.w3.org/International/ws
Re: Euro Currency for UK
Isn't Euro support added to all CP1252 versions of Windows 98 and later, and in Windows 95 if people manually visit some Microsoft web page and download an update for this? Yes (well, I'm not sure of the exact versions, but that's a minor matter). At this point most people who would have needed to update have done, but it's possible that users in countries that don't use the Euro haven't done so. Given that we are talking about the use of the symbol with a locale that is otherwise focused on people in Britain it's worth considering.
RE: Bangla: [ZWJ], [VIRAMA] and CV sequences
Peter-- ... But backward compatibility is also good-- it means the solution was good enough in the first place that people are using it. Not sure about this one, in the Unicode context in general. I have been told of all sorts of things which cannot be done in the name of backward compatibility even when it is demonstrated that the original solution was completely broken and it seems that no one had ever used it - because it cannot be guaranteed that no one has tried to use it, and so there just might be some broken or kludged texts out there whose integrity has to be guaranteed. I'm not saying that is a bad policy, just that the existence of the policy is not grounds for self-congratulation that none of the old solutions are broken. Yeah, you're right. I presume you're talking here mostly about the combining classes of the Hebrew vowel points. That was a case where even though the Hebrew encoding was clearly broken (insofar as Biblical Hebrew was concerned, anyway), fixes for the problem were constrained because there was a need to maintain backward compatibility ACROSS THE WHOLE STANDARD for reasons unrelated to Biblical Hebrew. So yeah, here the need to preserve backward compatibility tells us Unicode in general was good enough for people to use it, even though they couldn't use it for Biblical Hebrew. So yeah, I overstated my case. --Rich Gillam Language Analysis Systems
RE: Bangla: [ZWJ], [VIRAMA] and CV sequences
Title: Message Gautam-- [Gautam]: Well, too bad. I guess we still have an obligation to explore the extent ofsub-optimal solutions that are being imposed upon South-Asian scriptsfor the sake of *backward compatibility* or simply because they are "fait accomplis". (See Peter Kirk's posting on this issue).However, I am by no means suggesting that the fault lies with the Unicode Consortium. I'm a little confused by this statement. What would be the differencebetween sticking with a suboptimal solution because it's a fait accompli and sticking withit out of the need for backward compatibility? The need for backward compatibility exists because the suboptimal solution is a fait accompli. Or are you stating that backward compatibility is a specious argument because the encoding is so broken nobody's actually using it? [Gautam]: This is again the "fait accompli" argument. We need to *know* whether adopting an alternative model WOULD HAVE BEEN PREFERABLE, even if the option to do so is no longer available to us. I don't understand. If the option to go to an alternative model is not available, why is it important to know that the alternative model would have been preferable? [Gautam]: I think there is a slight misunderstanding here. The ZWJ I am proposingisscript-specific (each script would have its own), call it "ZWJ PRIME" or even "JWZ" (in order to avoid confusion with ZWJ). It doesn't exist yet and hence has no semantics. Okay. Maybe I'm dense, but this wasn't clear to me from your other emails. You're not proposing that U+200D be used to join Indic consonants together; you're basically arguing for virama-like functionality that goes far enough beyond what the virama does that you're not comfortable calling it a virama anymore. JWZ is a piece of formalism. Its meaning would be precisely what we chose to assignto it. It behaves like the existing (script-specific) VIRAMA's except that it also occurs between a consonant and an independent vowel, forcing the latter to show up in its combining form. Aha! This is what I wasn't parsing out of your previous emails. It was there, but I somehow didn't grok it. To summarize: Tibetan deals with consonant clusters by encoding each of the consonants twice: One series of codes is to be used for the first consonant in a cluster, and the other series is to be used for the others. The Indian scripts don't do this; they use a single series of codes for the consonants and cause consonants to form clusters by adding a VIRAMA code between them. But the Indian scripts still have two series of VOWELS more or less analogous to the two series of consonants in Tibetan. When you want a non-joining vowel, you use one series, and when you want a joining vowel, you use the other. You want to have one series of vowels and extend the virama model to conbining vowels. Thus, you'd represent KI as KA + VIRAMA + I; KA + I would represent two syllables: KA-I. Since a real virama never does this, you're using a different term ("JWZ" in your most recent message) for the character that causes the joining to happen. You're not proposing any difference in how consonants are treated, other than having this new character server the sticking-together function that the VIRAMA now serves and changing the existing VIRAMA to always display explicitly. Now do I understand you? Sorry for my earlier misunderstandings. Now that we have freed up all thosecode points occupied by the combining forms of vowels by introducing the VIRAMA with extended function, let us introduce an explicit (always visible) VIRAMA. That's all. As far as Unicode is concerned, you can't"free up" any code points. Once a code point is assigned, it's always assigned. You can deprecate code points, but that doesn't free them up to be reused; it only (with luck) keeps people from continuing to use them. It seems to me that a system could support the usage you want and the old usage at the same time. I could be wrong, but I'm guessing that KA + VIRAMA + I isn't a sequence that makes any sense with current implementations and isn't being used. It would be possible to extend the meaning of the current VIRAMA to turn the independent vowels into dependent vowels. Future use of the dependent-vowel code points could be discouraged in favor of VIRAMA plus the independent-vowel code points. Old documents would continue to work, but new documents could use the model you're after. (You get the explicit virama the same way you do now: VIRAMA + ZWNJ.) This solution would involve encoding no new characters and no removal of existing characters, but just a change in the semantics of the VIRAMA. That said, I'm not sure this is a good idea. If what you're really concerned about is typing and editing of text, you can have that work the way you want without changing the underlying encoding model. It involves somewhat more complicated
Re: Bangla: [ZWJ], [VIRAMA] and CV sequences
On 09/10/2003 08:44, Unicode (public) wrote: ... Yeah, you're right. I presume you're talking here mostly about the combining classes of the Hebrew vowel points. ... Mostly. I have come across other similar cases e.g. the Arabic hamza issue recently discussed on the bidi list, perhaps also the distinction between Greek tonos and acute. They are all cases where the stability policy forbids changes of combining class or deletion of a redundant character. ... That was a case where even though the Hebrew encoding was clearly broken (insofar as Biblical Hebrew was concerned, anyway), ... What is broken is the encoding of any sequence of vowels. Because of this no one had used Unicode for sequences of vowels. Except that someone may have tried, and although the resulting texts would be mixed up and invalid, apparently for backward compatibility that mixed-upness and invalidity has to be preserved. ... fixes for the problem were constrained because there was a need to maintain backward compatibility ACROSS THE WHOLE STANDARD for reasons unrelated to Biblical Hebrew. So yeah, here the need to preserve backward compatibility tells us Unicode in general was good enough for people to use it, ... Happily, yes! It would still have been good enough to use without those stability guarantees. It seems to me that some unwise promises were made which have caused the backward compatibility issue. I'm not convinced that those promises contributed much to the usability of Unicode; they may have made life a bit easier for some people, e.g. those who want to rely on data being normalised without the overhead of checking it, but made things a lot more difficult for some others. ... even though they couldn't use it for Biblical Hebrew. So yeah, I overstated my case. --Rich Gillam Language Analysis Systems -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Euro Currency for UK
- Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, October 09, 2003 5:20 PM Subject: Re: Euro Currency for UK Isn't Euro support added to all CP1252 versions of Windows 98 and later, and in Windows 95 if people manually visit some Microsoft web page and download an update for this? Yes (well, I'm not sure of the exact versions, but that's a minor matter). At this point most people who would have needed to update have done, but it's possible that users in countries that don't use the Euro haven't done so. Given that we are talking about the use of the symbol with a locale that is otherwise focused on people in Britain it's worth considering. The euro character was added to CP1252 back in 1999 and most systems have the character. However, the locales which should be using the euro were not updated and no replacement locales for Windows are directly available from Microsoft. They do have a tool available to add the euro as the default currency symbol to those locales which need it but that tool ONLY works if you have that locale as the default locale. This means that if I generate a new system (XP Professional) with all the latest updates but use UK as the standard locale and then try to switch to FRENCH/FRANCE I still get Francs! To get the locale to use euros I have to download this tool and run it while switched into the FRENCH/FRANCE locale! I'm not sure why you want to set the euro as the standard currency for UK as (at present) we have not switched to that currency!? Martin Green
Re: Euro Currency for UK
I think Addison is on the right track here. I would like to point to ICU sample code for this kind of thing: http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/samples/numfmt/main.cpp See the code there from setNumberFormatCurrency_2_6 on down (the preceding code is for older ICU versions and general number formatting API usage). ICU homepage: http://oss.software.ibm.com/icu/ Best regards, markus
Re: Bangla: [ZWJ], [VIRAMA] and CV sequences
Gautam asked: I stand corrected. Long syllabic /r l/ as well as Assamese /r v/ are indeed additions beyond the ISCII code chart. My objection, however, was not against their inclusion but against their placement. I understand why long syllabic /r l/ could not be placed with the vowels, but why were Assamese /r v/ assigned U+09F0 and U+09F1 instead of U+09B1 and U+09B5 respectively? Because the 7th and 8th rows in each of these Indic scripts was where additions beyond the ISCII repertoire were added. In the case of the Assamese letters, these additions separate out the *distinct* forms for Assamese /r/ and /v/ from the Bangla forms, and *enable* correct sorting, rather than inhibiting it. I fail to understand why Assamese /r v/ wouldn't be correctly sorted if placed in U+09F0 and U+09F1. I presume you mean U+09B1 and U+09B5. The answer is that no Indic script is correctly sorted simply by using code point order, anyway. You need a more sophisticated algorithm. And since such an algorithm will have weight tables, it doesn't *matter* where a particular character is in the code chart. See: http://www.unicode.org/notes/tn1/ for a discussion of these issues. Why do they need to be separated out from the Bangla forms in order to enable correct sorting? So that a tailored sorting for Assamese can be based on Assamese letters, and a tailored sorting for Bangla can be based on Bangla letters. The addition of the long syllabic /r/ and /l/ *enables* the representation of Sanskrit material in the Bengali script, and the code position in the charts is immaterial. As stated earlier, my objection is not against their inclusion, but against their positioning on the code chart. Why is their relative position in the chart immaterial for sorting? See the above technical note. If it will help you visualize the answer in some way, here is an excerpt from the Default Unicode Collation Element Table for the Unicode Collation Algorithm (Version 4.0), showing the default weight assignments for the relevant portion of the Bengali script: 09AA ; [.15C4.0020.0002.09AA] # BENGALI LETTER PA 09AB ; [.15C5.0020.0002.09AB] # BENGALI LETTER PHA 09AC ; [.15C6.0020.0002.09AC] # BENGALI LETTER BA 09AD ; [.15C7.0020.0002.09AD] # BENGALI LETTER BHA 09AE ; [.15C8.0020.0002.09AE] # BENGALI LETTER MA 09AF ; [.15C9.0020.0002.09AF] # BENGALI LETTER YA 09DF ; [.15C9.0020.0002.09AF][..00FD.0002.09BC] # BENGALI LETTER YYA; QQCM 09B0 ; [.15CA.0020.0002.09B0] # BENGALI LETTER RA 09F0 ; [.15CB.0020.0002.09F0] # BENGALI LETTER RA WITH MIDDLE DIAGONAL --- 09B2 ; [.15CC.0020.0002.09B2] # BENGALI LETTER LA 09F1 ; [.15CD.0020.0002.09F1] # BENGALI LETTER RA WITH LOWER DIAGONAL --- 09B6 ; [.15CE.0020.0002.09B6] # BENGALI LETTER SHA 09B7 ; [.15CF.0020.0002.09B7] # BENGALI LETTER SSA 09B8 ; [.15D0.0020.0002.09B8] # BENGALI LETTER SA primary weights, in sorted order As you can see, the two additional letters in question, in the default table, sort in exactly the order you are suggesting, and as I said, the position in the *code chart* doesn't matter. If it is merely because there are script-specific sorting mechanisms already in place, then it's just a bad excuse for a sloppy job. I sincerely hope there is more to it than just that. It truly does not matter. *No* script in the Unicode Standard is encoded completely in a collation order. *All* scripts must be handled via weight tables in order to produce desired sorting behavior. That is true for Latin, Greek, Cyrillic, ..., as well as Devanagari, Bengali, Gujarati, ..., so this is nothing particularly different about the encoding of Bengali. But be that as it may, they (TDIL) have nothing to do with the code point choices in the range U+09E0..U+09FF ... If this is indeed the case, then I must say it's rather unfortunate. As a full corporate member representing the Republic of India, the Ministry of Information Technology should have had a BIG say in the matter. Were they ever consulted on the issue? Of course, once they got involved. And they have been making suggestions ever since. But you need to recognize that the particular characters you are concerned about were standardized and published by ISO in 1993 (based, it is true, on charts published by Unicode even earlier, which in turn were based on the ISCII standard), well before the Government of India became a member of the Unicode Consortium. --Ken Did they try to intervene suo moto? Will a Unicode official kindly let us know? Best, -Gautam.
Common XML Data Locale Repository V1.0 Alpha Available!
Forwarded on behalf of Helena Chapman: The OpenI18N WG of the Free Standards Group is pleased to inform you that CLDR (Common XML Locale Data Repository) V1.0 Alpha snapshot is available. The CLDR repository provides application developers a consistent and uniform resource in managing the locale-sensitive data used for formatting, parsing, and analysis. It also includes the comparison charts that demonstrates the locale data differences on various platforms. For details on the locale data comparison charts, please see http://oss.software.ibm.com/cvs/icu/~checkout~/locale/all_diff_xml/comparison_charts.html. The V1.0 alpha is available at http://oss.software.ibm.com/cvs/icu/~checkout~/locale/common/xml/ via CVS under the tag release-1-0-alpha. To report problems, comments or defects, please submit a bug report at http://www.openi18n.org/locale-bugs/public. The V1.0 Locale Data Markup Language specification on which the CLDR data is based upon can be found at http://www.openi18n.org/specs/ldml/. Thank you. Regards, Helena Shih Chapman Manager, SWG Customer Satisfaction, Quality and ISO9K2K Co-Chair of OpenI18N / Free Standards Group -- Vladimir Weinstein, IBM GCoC-Unicode/ICU San Jose, CA [EMAIL PROTECTED]
Public Review Issue #23
Looking over the Public Review Issues... trying to scramble up the learning curve and make sense of some of what it's talking about... Here's a comment. I think U+05C3 HEBREW PUNCTUATION SOF PASUQ should probably also be in Sentence_Terminal. I suppose it's true that there are Biblical verses that are not complete grammatical sentences, but that's true of a lot of what gets marked as sentences. It certainly would obey the Principle of Least Astonishment, for me, if I hit the move one sentence forward key and it jumped to the next verse. Comments? ~mark
RE: Bangla: [ZWJ], [VIRAMA] and CV sequences
--- "Unicode (public)" [EMAIL PROTECTED] wrote: Gautam-- ... I don't understand. If the option to go to an alternative model is notavailable, why is it important to know that the alternative model would have been preferable? [Gautam]: Just for the sake of knowing, I guess. "... ripeness is all". [Gautam]: I think there is a slightmisunderstanding here. TheZWJ I am proposing is script-specific (each scriptwould have its own),call it "ZWJ PRIME" or even "JWZ" (in order to avoidconfusion withZWJ). It doesn't exist yet and hence has no semantics. Okay. Maybe I'm dense, but this wasn't clear to mefrom your otheremails. [Gautam]: Heavens, no! It must be my non-native English that's creating all these communication gaps. You're not proposing that U+200D be used tojoin Indicconsonants together; you're basically arguing forvirama-likefunctionality that goes far enough beyond what the virama does thatyou're not comfortable calling it a virama anymore. [Gautam]: Indeed. You got it just right. Let us introduce the term "Ind VIRAMA" to refer to the virama used in Sanskrit and other Indic languages, and Uni VIRAMA" to refer to the virama in Unicode. The two are *not* identical. Uni VIRAMA lacks the full functionality ofInd Virama. I am proposing two extensions to Uni Virama: 1. extension of its functionality to allow cons+combining vowel to be encoded as ConsVIRAMAfull Vowel, and 2. extension of its functionality further to allow vowel+yophola to be encoded as VowelVIRAMAfull Y (1) merely confers on Uni VIRAMA the full functionality of Ind VIRAMA, making the two functionally identical. (2) is a hack, a crude ad hoc solution to the problem of how to encode Bangla vowel+yophola sequences. It is THIS latter extension that would make Uni VIRAMA un-VIRAMA-like, and hence my discomfiture with the name "VIRAMA". But (2) can be avoided if we can find some other solution to the YOPHOLA problem, such as assigning a code point to YOPHOLA in addition to the one already assigned to Y. And this (that is, addition of a distinct YOPHOLA on the code chart), by the way, would also disambiguate RY sequences in Bangla. (See Paul Nelson, "Bengali Script: Formation of the Reph and use of the ZERO WIDTH JOINER and ZERO WIDTH NON-JOINER"). I now feel that it is better to avoid extension 2 for the sake of keeping the model clean. Let us say we find some other acceptable solution to the problems raised by combinations involving YOPHOLA. To summarize: Tibetan deals with consonant clusters by encodingeach of the consonantstwice: One series of codes is to be used for thefirst consonant in acluster, and the other series is to be used for the others. The Indianscripts don't do this; they use a single series ofcodes for theconsonants and cause consonants to form clusters byadding a VIRAMA codebetween them. But the Indian scripts still have twoseries of VOWELS more or less analogous to the two series ofconsonants in Tibetan. Whenyou want a non-joining vowel, you use one series,and when you want ajoining vowel, you use the other. [Gautam]: In UnicodeIndic CV and CC sequencesare treated differently. It uses the VIRAMA model for CC clusters, but the Tibetan model for CV's. I am suggesting the use of the VIRAMA model for BOTH. You want to have one series of vowels and extend thevirama model tocombining vowels. Thus, you'd represent KI as KA +VIRAMA + I; KA + Iwould represent two syllables: KA-I. [Gautam]: Yes. Since a realvirama never doesthis, you're using a different term("JWZ" in yourmost recent message)for the character that causes the joining tohappen. [Gautam]: No, the*real* Ind VIRAMA doesexactly this. Hence with this extension only (that is, as long as extension 2 is not implemented) I feel no compulsion to rename VIRAMA. You're notproposing any difference in how consonants aretreated, other thanhaving this new character server thesticking-together function that theVIRAMA now serves and changing the existing VIRAMAto always displayexplicitly. Now do I understand you? Sorry for my earliermisunderstandings. [Gautam]: Yes, butnote the clarifications providedin the preceding paragraphs. Now that we have freed up all those code pointsoccupied by thecombining forms of vowels by introducing the VIRAMAwith extendedfunction, let us introduce an explicit (alwaysvisible) VIRAMA. That'sall. As far as Unicode is concerned, you can't "free up"any code points.Once a code point is assigned, it's always assigned.You can deprecatecode points, but that doesn't free them up to bereused; it only (withluck) keeps people from continuing to use them. [Gautam]: This is just too bad. It seems to me that a system could support the usageyou want and theold usage at the same time. I could be wrong, butI'm guessing that KA+ VIRAMA + I isn't a sequence that makes any sensewith currentimplementations and isn't being used. It would bepossible to extendthe meaning of the current VIRAMA to turn the
Re: Public Review Issue #23
On Thursday, October 09, 2003 11:19 PM, Mark E. Shoulson wrote: Looking over the Public Review Issues... trying to scramble up the learning curve and make sense of some of what it's talking about... Here's a comment. I think U+05C3 HEBREW PUNCTUATION SOF PASUQ should probably also be in Sentence_Terminal. I suppose it's true that there are Biblical verses that are not complete grammatical sentences, but that's true of a lot of what gets marked as sentences. It certainly would obey the Principle of Least Astonishment, for me, if I hit the move one sentence forward key and it jumped to the next verse. Comments? I agree. Ted Ted Hopp, Ph.D. ZigZag, Inc. [EMAIL PROTECTED] +1-301-990-7453 newSLATE is your personal learning workspace ...on the web at http://www.newSLATE.com/