RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

2003-10-09 Thread Gautam Sengupta



--- "Unicode (public)" [EMAIL PROTECTED] wrote:  Two of the most basic Unicode stability policies dictate that character  assignments, once made, are never removed and character names can never  change. Step 4 cannot happen; the best that can happen is that the code  points in question can be deprecated. The renaming you suggest in 1  cannot happen either. 

[Gautam]: Well, too bad. I guess we still have an obligation to explore the extent ofsub-optimal solutions that are being imposed upon South-Asian scriptsfor the sake of *backward compatibility* or simply because they are "fait accomplis". (See Peter Kirk's posting on this issue).However, I am by no means suggesting that the fault lies with the Unicode Consortium.
 The change in the encoding model for the virama can't happen either;  there are too many implementations based on it, and there are too many  documents out there that use the current encoding model. Your  suggestion wouldn't make them unreadable when opened with software that  did things the way you're suggesting, but it would change their  appearance in ways that are unlikely to be acceptable. 

[Gautam]: This is again the "fait accompli" argument. We need to *know* whether adopting an alternative model WOULD HAVE BEEN PREFERABLE, even if the option to do so is no longer available to us. The model I am proposing is precisely the one that has been in use for centuries in the Indian grammatical tradition (/ki/ = k+virama+i). I don't think there are too many South-Asian documents out there encoded in Unicode. At any rate converting them would be a rather simple matter of searching for combining forms of vowels and replacing them bythe [VIRAMA][VOWEL]sequence. The TDIL corpora are very small by current standards, and they require extensive reworking anyway.
 [I preface what follows with the observation that I'm not by any stretch  of the imagination an expert on Indic scripts, but I do fancy myself an  expert on Unicode.]   I'm also pretty sure that using ZWJ as a viramawon't work and isn't  intended to work. KA + ZWJ + KA means something totally different from  KA + VIRAMA + KA, and I, for one, wouldn't expect them to be drawn the  same. U+0915 represents the letter KA with its inherent vowel sound;  that is, it represents the whole syllable KA. Two instances of U+0915  in a row would thus represent "KAKA", completely irrespective of how  they're drawn. Introducing a ZWJ in the middle would allow the two  SYLLABLES to ligate, but there's no ligature that represents "KAKA", so  you should get the same appearance as you do without the ZWJ. The  virama, on the other hand, cancels the vowel sound on the KA, tu!
rning it
  into K: The sequence KA + VIRAMA + KA represents the syllable KKA, again  irrespective of how it is drawn.   In other words, ZWJ is intended to change the APPEARANCE of a piece of  text without changing its MEANING (there are exceptions in the Arabic  script, but this is the general rule). Having KA + ZWJ + KA render as  the syllable KKA would break this rule: the ZWJwould be changing the  MEANING of the text. 

 Whether the syllable KKA gets drawn with a virama, a half-form, or a  ligature is the proper province of ZWJ and ZWNJ, andthis is what  they're documented in TUS to do. But ZWJ can't (and shouldn't) be used  to turn KAKA into KKA. 

[Gautam]: I think there is a slight misunderstanding here. The ZWJ I am proposingisscript-specific (each script would have its own), call it "ZWJ PRIME" or even "JWZ" (in order to avoid confusion with ZWJ). It doesn't exist yet and hence has no semantics. JWZ is a piece of formalism. Its meaning would be precisely what we chose to assignto it. It behaves like the existing (script-specific) VIRAMA's except that it also occurs between a consonant and an independent vowel, forcing the latter to show up in its combining form. In this respect, it is in fact *closer* or *more faithful* to the classical VIRAMA model. Call it VIRAMA if you will. The only reason why I don't wish to call it "VIRAMA" is because I plan to use it after a vowel as well, as in: AJWZYJWZAA encoding A+YOPHOLA+AA. If YOPHOLA is assigned an independent code point then this move would be unnecessary and my JWZ would just be theusualVIRAMA
 withan extended function that would, in fact, make it more compliant with the classical VIRAMA model.

Now that we have freed up all thosecode points occupied by the combining forms of vowels by introducing the VIRAMA with extended function, let us introduce an explicit (always visible) VIRAMA. That's all.
 Maybe it was unfortunate to call U+094D a "virama," since it doesn't  necessarily get drawn as a virama (or, indeed, as anything), but it's  too late to revisit that decision. 

No, the decision is not unfortunate because of that, but rather because U+094D doesn't behave like a virama in all respects, and hence my proposal for extension of its functions.

 For that matter, it may have been amistake to use the virama model to encode 
 

RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

2003-10-09 Thread Marco Cimarosti
Gautam Sengupta wrote:
 --- Marco Cimarosti wrote:
  OK but, then, your ZWJ becomes exactly what
  Unicode's VIRAMA has always
  been: [...]
 
 You are absolutely right. I am suggesting that the
 language-specific viramas be retained as
 script-specific *explicit* viramas that never
 disappear. In addition, let's have a script-specific
 ZWJ which behaves in the way you describe in the
 preceding paragraph.


Good, good. We are making small steps forward.

What are you really asking for is that each Indic script has *two* viramas:


- a soft virama, which is normally invisible and only displays visibly in
special cases (no ligatures for that cluster);

- a hard virama (or explicit virama, as you correctly called it), which
always displays as such and never ligates with adjacent characters.


Let's assume that it would be handy to assign these two viramas to different
keys on the keyboard. Or, even better, let's assign the soft virama to the
plain key and the hard virama to the SHIFT key, OK? To avoid
misunderstandings with the term virama, let's label this key JOINER.

Now, this is what you *already* have in Unicode! On our hypothetic Bangla
keyboard:


- the soft virama (the plain JOINER key) is Unicode's BENGALI SIGN
VIRAMA;

- the hard virama (the SHIFT+JOINER key) is Unicode's BENGALI SIGN
VIRAMA+ZWNJ.


Not only Unicode allows all of the above, but it also has a third kind of
virama, which may or may not be useful in Bangla but is certainly useful
in Devanagari and Gujarati:


- the half-consonant virama (let's assign it to the ALT+JOINER key in out
hypothetical keyboard) which forces the preceding consonant to be displayed
as an half consonant, if possible. This is Unicode's BENGALI SIGN
VIRAMA+ZWJ.


Notice that, once you have these three viramas on your keyboard, you don't
need to have keys for ZWJ and ZWNJ, as their only use, in Indic, is
after a xxx SIGN VIRAMA.

Apart the fact that two of the three viramas are encoded as a *pair* of code
points, how does the *current* Unicode model impede you to implement the
clean theoretical model that you have in mind?


 [...] 
  - independent and dependent vowels were the same
  characters;
 [...] 
 
 I agree with you on all of these issues. You have in
 fact summed up my critique of the ISCII/Unicode model.


OK. But are you sure that this critique should necessarily be moved to the
*encoding* model, rather than to some other part of the chain. I'll now try
to demonstrate how also the redundancy of dependent/independent vowels may
be solved at the *keyboard* level.

You are certainly aware that some national keyboards have the so-called
dead keys. A dead key is a key which does not immediately send (a)
character(s) to the application but waits for a second key; in European
keyboards dead keys are used to type accented letters. E.g., let's see how
accented letters are typed on the Spanish keyboard (which, BTW, is by far
the best designed keyboard in Western Europe):


1. If you press the ´ key, nothing is sent to the application, but the
keystroke is memorized by the keyboard driver.

2. If you now press one of a, e, i, o, u or y keys, characters
á, é, í, ó, ú or ý are sent to the application.

3. If you press the space bar, character ´ itself is sent to the
application;

4. If you press any other key, e.g. m, the two characters ´ and m are
sent to the application in this order.


Now, in the description above substitute:


- the ´ key with 0985 BENGALI LETTER A (but let's label it VIRTUAL
CONSONANT);

- the a ... y keys with 09BE BENGALI VOWEL SIGN AA ... 09CC BENGALI
VOWEL SIGN AU;

- the á ... ý characters with 0986 BENGALI LETTER AA ... 0994 BEGALI
LETTER AU.


What you have is a Bangla keyboard where dependent vowels are typed with a
single vowel keystroke, and independent vowels are typed with the sequence
VIRTUAL CONSONANT+vowel.

Do you prefer your cons+VIRAMA+vowel model? Personally, I find it is
suboptimal, as it requires, on average, more keystrokes. However, if that's
what you want, in the Spanish keyboard description above substitute:


- the ´ key with the unshifted JOINER (= virama) key that we have
already defined above;

- the a ... y keys with 0986 BENGALI LETTER AA ... 0994 BEGALI LETTER
AU;

- the á ... ý characters with 09BE BENGALI VOWEL SIGN AA ... 09CC
BENGALI VOWEL SIGN AU.


Now you have a Bangla keyboard where independent vowels are typed with a
single keystroke, and dependent vowels are typed with the sequence
JOINER+vowel.


_ Marco




Re: Bangla: [ZWJ], [VIRAMA] and CV sequences

2003-10-09 Thread Gautam Sengupta
Ken,

I stand corrected. Long syllabic /r l/ as well as
Assamese /r v/ are indeed additions beyond the ISCII
code chart. My objection, however, was not against
their inclusion but against their placement. I
understand why long syllabic /r l/ could not be placed
with the vowels, but why were Assamese /r v/ assigned
U+09F0 and U+09F1 instead of U+09B1 and U+09B5
respectively?

 --- Kenneth Whistler [EMAIL PROTECTED] wrote:

 In the case of the Assamese letters, these 
 additions separate out the *distinct* forms for 
 Assamese /r/ and /v/ from the Bangla forms, and 
 *enable* correct sorting, rather than inhibiting it.

I fail to understand why Assamese /r v/ wouldn't be
correctly sorted if placed in U+09F0 and U+09F1. Why
do they need to be separated out from the Bangla forms
in order to enable correct sorting?

 The addition of the long syllabic /r/ and /l/ 
 *enables* the representation of Sanskrit
 material in the Bengali script, and the code
 position in the charts is immaterial.

As stated earlier, my objection is not against their
inclusion, but against their positioning on the code
chart. Why is their relative position in the chart
immaterial for sorting? If it is merely because there
are script-specific sorting mechanisms already in
place, then it's just a bad excuse for a sloppy job. I
sincerely hope there is more to it than just that.

 But be that as it may, they (TDIL) have nothing to 
 do with the code point choices in the range 
 U+09E0..U+09FF ...

If this is indeed the case, then I must say it's
rather unfortunate. As a full corporate member
representing the Republic of India, the Ministry of
Information Technology should have had a BIG say in
the matter. Were they ever consulted on the issue? Did
they try to intervene suo moto? Will a Unicode
official kindly let us know? Best, -Gautam.


__
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com



Re: Euro Currency for UK

2003-10-09 Thread Peter Kirk
On 08/10/2003 16:52, Jain, Pankaj (MED, TCS) wrote:

Hi,
I Have requirement to display Euro Currency Symbol for en_GB 
locale.I know that if we use en_GB as CurrencLocale, then it default 
to Pound.Is there any way I can set it to Euro.
 
Thanks
Pankaj
Our default currency in the UK is still the pound sterling. It will take 
more than you changing some settings to change it to the Euro! :-)

The Euro symbol is available, and should be displayed correctly if you 
have a suitable font, in CP1252 and ISO-8859-1 which are the usual 
legacy encodings used in the UK - and of course in Unicode. I assume you 
are not using a system from before about 1998 when the Euro was added to 
systems and fonts. Anything beyond that depends on what system you are 
referring to, and so is probably not really a matter for this list.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Cursor movement in Hebrew, was: Non-ascii string processing?

2003-10-09 Thread Peter Kirk
On 08/10/2003 21:55, Jungshik Shin wrote:

...

 I've got a question about the cursor movement and
selection in Hebrew text with such a grapheme (made up of 6 Unicode
characters). What would be ordinary users' expectation when delete,
backspace, and arrow keys(for cursor movement)  are pressed around/in the
middle of that DGC? Do they expect backspace/delete/arrow keys to operate
_always_ at the DGC level or sometimes do they want them to work at the
Unicode character level (or its equivalent in their perception of Hebrew
'letters')? Exactly the same question can be asked of Indic scripts.
I've asked this before (discussed the issue with Marco a couple of years
ago), but I haven't heard back from native users of Indic scripts.
 Jungshik



 

I can't answer for native users of Hebrew. Maybe others can, but then 
most modern Hebrew word processing is done with unpointed text where 
this is not an issue. But I can speak for what has been done with 
Windows fonts for pointed Hebrew for scholarly purposes.

In each of them, as far as I can remember, delete and backspace delete 
only a single character, not a default grapheme cluster. This is 
probably appropriate for a font used mainly for scholarly purposes, 
where representations of complex grapheme clusters may need to be edited 
to make them exactly correct. A different approach might be more 
suitable for a font commonly used for entering long texts. In such a 
case I would tend to expect backspace to cancel one keystroke - but that 
may be ambiguous of course when editing text which has not just been 
entered.

Cursor movement also works at the character level. In some fonts there 
is no visible cursor movement when moving over a non-spacing character, 
which is probably the default but can be confusing to users. At least 
one font has attempted to place the cursor at different locations within 
the base character e.g. in the middle when there are two characters in 
the DGC, at the 1/3 and 2/3 points when there are three characters. But 
this is likely to get confusing when there are 5 or 6 characters in the 
DGC and their order is not entirely predictable.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: Cursor movement in Hebrew, was: Non-ascii string processing?

2003-10-09 Thread Mark E. Shoulson
Peter Kirk wrote:

On 08/10/2003 21:55, Jungshik Shin wrote:

...

 I've got a question about the cursor movement and
selection in Hebrew text with such a grapheme (made up of 6 Unicode
characters). What would be ordinary users' expectation when delete,
backspace, and arrow keys(for cursor movement)  are pressed around/in 
the
middle of that DGC? Do they expect backspace/delete/arrow keys to 
operate
_always_ at the DGC level or sometimes do they want them to work at the
Unicode character level (or its equivalent in their perception of Hebrew
'letters')? Exactly the same question can be asked of Indic scripts.
I've asked this before (discussed the issue with Marco a couple of years
ago), but I haven't heard back from native users of Indic scripts.

 Jungshik

I can't answer for native users of Hebrew. Maybe others can, but then 
most modern Hebrew word processing is done with unpointed text where 
this is not an issue. But I can speak for what has been done with 
Windows fonts for pointed Hebrew for scholarly purposes.

In each of them, as far as I can remember, delete and backspace delete 
only a single character, not a default grapheme cluster. This is 
probably appropriate for a font used mainly for scholarly purposes, 
where representations of complex grapheme clusters may need to be 
edited to make them exactly correct. A different approach might be 
more suitable for a font commonly used for entering long texts. In 
such a case I would tend to expect backspace to cancel one keystroke - 
but that may be ambiguous of course when editing text which has not 
just been entered.

Cursor movement also works at the character level. In some fonts there 
is no visible cursor movement when moving over a non-spacing 
character, which is probably the default but can be confusing to 
users. At least one font has attempted to place the cursor at 
different locations within the base character e.g. in the middle when 
there are two characters in the DGC, at the 1/3 and 2/3 points when 
there are three characters. But this is likely to get confusing when 
there are 5 or 6 characters in the DGC and their order is not entirely 
predictable. 
I'm not a native speaker either, but I do have some occasion to work in 
both pointed and unpointed Hebrew, and I think I would disagree with 
Peter here.  Certainly in the case of cursor movement, I'd expect the 
cursor to move by DGCs, and not take some unclear number of keypresses 
to move back a letter.  With backspace/delete, I would probably want 
that to work by characters within the current DGC, but once past that 
(or if I'm not doing it immediately after typing the characters) it 
should take out whole DGCs.  They're just too messy and potentially 
randomly ordered for it to make any sense to try to edit them 
internally.  So I guess I see Hebrew DGCs as also going through a sort 
of commitment phase, when you type the next base character or use 
cursor-movement keys to move around: at that point, the DGC should go 
atomic and get deleted all at once, but so long as you're still typing 
combining characters (and occasional backspaces), backspace should go 
character by character (since you presumably can remember the last few 
you just typed).

Mind, I've not actually used all that many pointed-Hebrew text 
processors; this is more my idea of how things *should* work than how 
they *do* work.  I think Yudit does or did something a bit like this, 
though.  (must have been did: at the moment it seems to be consistent 
about always doing everything by DGC).

~mark





Re: Euro Currency for UK

2003-10-09 Thread jon
 The Euro symbol is available, and should be displayed correctly if you 
 have a suitable font, in CP1252 and ISO-8859-1 which are the usual 
 legacy encodings used in the UK - and of course in Unicode.

The Euro symbol is not in ISO 8859-1, it is however in ISO 8859-15 and ISO 8859-16. It 
was added to CP1252 after the inital specification of CP1252 and hence some systems 
may not render it correctly (especially since the update may have seemed a pointless 
install to some outside of the jurisdictions in which the Euro is legal tender).

I think the question though is how to get some particular locale system to use that 
symbol as the default currency character.







Re: Cursor movement in Hebrew, was: Non-ascii string processing?

2003-10-09 Thread Ted Hopp
One issue with deleting a DGC non-atomically is that deleting only the base
character can lead to all sorts of strange and problematic combining
character sequences. At a minimum, deleting a base character should delete
the entire DGC atomically. In Hebrew, I don't see any problem with deleting
combining characters non-atomically (although one might want to limit this
to just off the logical end of the sequence out of user interface
considerations). I suppose that this might be more of an issue in some other
languages, though.

One might be tempted to use some sort of canonical ordering logic to keep
the complexity down, but the combining classes for Hebrew are so problematic
that this would be a lost cause.

I have used software where the cursor moves non-atomically across a DGC in
Hebrew and I find it extremely confusing. The only way to make sense of
what's happening is to remember the exact sequence in which the combining
characters were entered. If someone wants to support such movement anyway, I
think that the cursor shape needs to change dramatically to indicate what's
going on. This is something I've never seen done well (usually not at all).
Subtle changes in cursor position are useless as a visual indication to the
user of what's going on. One might even need to include some sort of glyph
highlighting to make clear the state of the text entry system.

Ted


Ted Hopp, Ph.D.
ZigZag, Inc.
[EMAIL PROTECTED]
+1-301-990-7453

newSLATE is your personal learning workspace
   ...on the web at http://www.newSLATE.com/





Re: Euro Currency for UK

2003-10-09 Thread Stefan Persson
[EMAIL PROTECTED] wrote:

The Euro symbol is not in ISO 8859-1, it is however in ISO 8859-15 and ISO 8859-16. It was added to CP1252 after the inital specification of  CP1252 and hence some systems may not render it correctly (especially since the update may have seemed a pointless install to some outside of  the jurisdictions in which the Euro is legal tender).
Isn't Euro support added to all CP1252 versions of Windows 98 and later, 
and in Windows 95 if people manually visit some Microsoft web page and 
download an update for this?  My copy of iconv for Linux supports  in 
CP1252, and all of my other CP1252-compatible programs (e.g. Mozilla) 
also seem to support it.

Stefan




Re: Euro Currency for UK

2003-10-09 Thread Addison Phillips [wM]
Hmm.. this isn't really a Unicode question. You might want to post this 
question over on the i18n programming list '[EMAIL PROTECTED]' 
or on the locales list at '[EMAIL PROTECTED]'.

You don't say what your programming or operating environments are. There 
are two possibilities here.

If you want to use your existing software to display currencies as the 
Euro instead of pounds, you can generally either set the display 
settings (Windows Regional options control panel) for currency to look 
like the Euro. Or you can set (on Unix systems) the LC_MONETARY locale 
variable to some locale that uses the Euro with English-like formatting. 
A few systems actually provide a specialized variant locale for 
[EMAIL PROTECTED] for this purpose. A few provide an [EMAIL PROTECTED], which won't be 
helpful to you because of differences in the separators used in the two 
locales.

You can also compile your own locale tables on Unix. Read the man pages 
on locale.

If you are writing your own software, then it really isn't that hard. 
Some programming environments, such as Java, provide either a separate 
Currency class with the ability to create specific display-time formats 
that take both the currency and the display locale into account. Others 
require you to create a formatter to convert the value into a string for 
display.

In fact, when working with currency it is important to associate which 
currency you mean with the value. You may experience problems if you 
create a data field for value and format it according to the machine's 
runtime locale. The runtime locale can imply a certain default currency, 
as you note, but default does not mean only. Consider:

value123.45/value

Not right:

en_GB: 123,45
en_US: $123.45
de_DE: 123,45
ja_JP: 123
Most commonly the ISO4217 currency code is associated with a value to 
create a data structure that is specific:

value
  amount123.45/amount
  currencyEUR/currency
/value
en_GB: 123,45
en_US: 123.45
de_DE: 123,45
ja_JP: 123.45
Getting the formatting right is a matter of accessing the formatting 
fucctions of your programming API correctly. Most programming 
environments provide a way to format a value using separate locale rules 
(for grouping and decimal separators) and currency.

More information about what you're trying to do would help in 
recommending a solution.

Best Regards,

Addison

--
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
+1 408.962.5487  mailto:[EMAIL PROTECTED]
---
Internationalization is an architecture. It is not a feature.
Chair, W3C I18N WG Web Services Task Force
http://www.w3.org/International/ws




Re: Euro Currency for UK

2003-10-09 Thread jon
 Isn't Euro support added to all CP1252 versions of Windows 98 and later, 
 and in Windows 95 if people manually visit some Microsoft web page and 
 download an update for this?

Yes (well, I'm not sure of the exact versions, but that's a minor matter). At this 
point most people who would have needed to update have done, but it's possible that 
users in countries that don't use the Euro haven't done so. Given that we are talking 
about the use of the symbol with a locale that is otherwise focused on people in 
Britain it's worth considering.







RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

2003-10-09 Thread Unicode (public)
Peter--

... But backward compatibility is also good-- it
means the solution was good enough in the first place that people are 
using it.
  

Not sure about this one, in the Unicode context in general. I have been

told of all sorts of things which cannot be done in the name of
backward 
compatibility even when it is demonstrated that the original solution 
was completely broken and it seems that no one had ever used it - 
because it cannot be guaranteed that no one has tried to use it, and so

there just might be some broken or kludged texts out there whose 
integrity has to be guaranteed. I'm not saying that is a bad policy, 
just that the existence of the policy is not grounds for 
self-congratulation that none of the old solutions are broken.

Yeah, you're right.

I presume you're talking here mostly about the combining classes of the
Hebrew vowel points.  That was a case where even though the Hebrew
encoding was clearly broken (insofar as Biblical Hebrew was concerned,
anyway), fixes for the problem were constrained because there was a need
to maintain backward compatibility ACROSS THE WHOLE STANDARD for reasons
unrelated to Biblical Hebrew.  So yeah, here the need to preserve
backward compatibility tells us Unicode in general was good enough for
people to use it, even though they couldn't use it for Biblical Hebrew.
So yeah, I overstated my case.

--Rich Gillam
  Language Analysis Systems



RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

2003-10-09 Thread Unicode (public)
Title: Message



Gautam--

  
  [Gautam]: Well, 
  too bad. I guess we still have an obligation to explore the extent 
  ofsub-optimal solutions that are being imposed upon South-Asian 
  scriptsfor the sake of *backward compatibility* or simply because they 
  are "fait accomplis". (See Peter Kirk's posting on this issue).However, 
  I am by no means suggesting that the fault lies with the Unicode 
  Consortium.
I'm a little 
confused by this statement. What would be the differencebetween 
sticking with a suboptimal solution because it's a fait accompli and sticking 
withit out of the need for backward compatibility? The need for 
backward compatibility exists because the suboptimal solution is a fait 
accompli. Or are you stating that backward compatibility is a specious 
argument because the encoding is so broken nobody's actually using 
it?

  
  [Gautam]: This is again the "fait accompli" argument. We need to *know* 
  whether adopting an alternative model WOULD HAVE BEEN PREFERABLE, even if the 
  option to do so is no longer available to us.
I 
don't understand. If the option to go to an alternative model is not 
available, why is it important to know that the alternative model would have 
been preferable?


  [Gautam]: I think there is a slight misunderstanding here. The ZWJ I am 
  proposingisscript-specific (each script would have its own), call 
  it "ZWJ PRIME" or even "JWZ" (in order to avoid confusion with ZWJ). It 
  doesn't exist yet and hence has no semantics.
Okay. Maybe I'm dense, but this wasn't clear to me from your other 
emails. You're not proposing that U+200D be used to join Indic consonants 
together; you're basically arguing for virama-like functionality that goes far 
enough beyond what the virama does that you're not comfortable calling it a 
virama anymore.

  JWZ is a piece of formalism. 
  Its meaning would be precisely what we chose to assignto it. It behaves 
  like the existing (script-specific) VIRAMA's except that it also occurs 
  between a consonant and an independent vowel, forcing the latter to show up in 
  its combining form.
Aha! This is what I wasn't parsing out of your previous 
emails. It was there, but I somehow didn't grok it. To 
summarize:

Tibetan deals with consonant clusters by encoding each of the consonants 
twice: One series of codes is to be used for the first consonant in a cluster, 
and the other series is to be used for the others. The Indian scripts 
don't do this; they use a single series of codes for the consonants and cause 
consonants to form clusters by adding a VIRAMA code between them. But the 
Indian scripts still have two series of VOWELS more or less analogous to the two 
series of consonants in Tibetan. When you want a non-joining vowel, you 
use one series, and when you want a joining vowel, you use the 
other.

You want to have one series of vowels and extend the virama model to 
conbining vowels. Thus, you'd represent KI as KA + VIRAMA + I; KA + I 
would represent two syllables: KA-I. Since a real virama never does this, 
you're using a different term ("JWZ" in your most recent message) for the 
character that causes the joining to happen. You're not proposing any 
difference in how consonants are treated, other than having this new character 
server the sticking-together function that the VIRAMA now serves and changing 
the existing VIRAMA to always display explicitly.

Now do I understand you? Sorry for my earlier 
misunderstandings.

  Now that we have freed up all thosecode points occupied by the 
  combining forms of vowels by introducing the VIRAMA with extended function, 
  let us introduce an explicit (always visible) VIRAMA. That's all.
As far as Unicode is concerned, you can't"free up" any code 
points. Once a code point is assigned, it's always assigned. You can 
deprecate code points, but that doesn't free them up to be reused; it only (with 
luck) keeps people from continuing to use them.

It seems to me that a system could support the usage you want and the old 
usage at the same time. I could be wrong, but I'm guessing that KA + 
VIRAMA + I isn't a sequence that makes any sense with current implementations 
and isn't being used. It would be possible to extend the meaning of the 
current VIRAMA to turn the independent vowels into dependent vowels. 
Future use of the dependent-vowel code points could be discouraged in favor of 
VIRAMA plus the independent-vowel code points. Old documents would 
continue to work, but new documents could use the model you're after. (You 
get the explicit virama the same way you do now: VIRAMA + ZWNJ.) This 
solution would involve encoding no new characters and no removal of existing 
characters, but just a change in the semantics of the 
VIRAMA.

That said, I'm not sure this is a good idea. If what you're really 
concerned about is typing and editing of text, you can have that work the way 
you want without changing the underlying encoding model. It involves 
somewhat more complicated 

Re: Bangla: [ZWJ], [VIRAMA] and CV sequences

2003-10-09 Thread Peter Kirk
On 09/10/2003 08:44, Unicode (public) wrote:

...

Yeah, you're right.

I presume you're talking here mostly about the combining classes of the
Hebrew vowel points. ...
Mostly. I have come across other similar cases e.g. the Arabic hamza
issue recently discussed on the bidi list, perhaps also the distinction
between Greek tonos and acute. They are all cases where the stability
policy forbids changes of combining class or deletion of a redundant
character.
... That was a case where even though the Hebrew
encoding was clearly broken (insofar as Biblical Hebrew was concerned,
anyway), ...
What is broken is the encoding of any sequence of vowels. Because of
this no one had used Unicode for sequences of vowels. Except that
someone may have tried, and although the resulting texts would be mixed
up and invalid, apparently for backward compatibility that mixed-upness
and invalidity has to be preserved.
... fixes for the problem were constrained because there was a need
to maintain backward compatibility ACROSS THE WHOLE STANDARD for reasons
unrelated to Biblical Hebrew.  So yeah, here the need to preserve
backward compatibility tells us Unicode in general was good enough for
people to use it, ...
Happily, yes! It would still have been good enough to use without those
stability guarantees. It seems to me that some unwise promises were made
which have caused the backward compatibility issue. I'm not convinced
that those promises contributed much to the usability of Unicode; they
may have made life a bit easier for some people, e.g. those who want to
rely on data being normalised without the overhead of checking it, but
made things a lot more difficult for some others.
... even though they couldn't use it for Biblical Hebrew.
So yeah, I overstated my case.
--Rich Gillam
 Language Analysis Systems


 



--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Re: Euro Currency for UK

2003-10-09 Thread Martin JD Green

- Original Message - 
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, October 09, 2003 5:20 PM
Subject: Re: Euro Currency for UK


  Isn't Euro support added to all CP1252 versions of Windows 98 and later,
  and in Windows 95 if people manually visit some Microsoft web page and
  download an update for this?

 Yes (well, I'm not sure of the exact versions, but that's a minor matter).
At this point most people who would have needed to update have done, but
it's possible that users in countries that don't use the Euro haven't done
so. Given that we are talking about the use of the symbol with a locale that
is otherwise focused on people in Britain it's worth considering.


The euro character was added to CP1252 back in 1999 and most systems have
the character. However, the locales which should be using the euro were not
updated and no replacement locales for Windows are directly available from
Microsoft. They do have a tool available to add the euro as the default
currency symbol to those locales which need it but that tool ONLY works if
you have that locale as the default locale.

This means that if I generate a new system (XP Professional) with all the
latest updates but use UK as the standard locale and then try to switch to
FRENCH/FRANCE I still get Francs! To get the locale to use euros I have to
download this tool and run it while switched into the FRENCH/FRANCE locale!

I'm not sure why you want to set the euro as the standard currency for UK as
(at present) we have not switched to that currency!?

Martin Green





Re: Euro Currency for UK

2003-10-09 Thread Markus Scherer
I think Addison is on the right track here.

I would like to point to ICU sample code for this kind of thing: 
http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/samples/numfmt/main.cpp

See the code there from setNumberFormatCurrency_2_6 on down (the preceding code is for older ICU 
versions and general number formatting API usage).

ICU homepage: http://oss.software.ibm.com/icu/

Best regards,
markus



Re: Bangla: [ZWJ], [VIRAMA] and CV sequences

2003-10-09 Thread Kenneth Whistler
Gautam asked:

 I stand corrected. Long syllabic /r l/ as well as
 Assamese /r v/ are indeed additions beyond the ISCII
 code chart. My objection, however, was not against
 their inclusion but against their placement. I
 understand why long syllabic /r l/ could not be placed
 with the vowels, but why were Assamese /r v/ assigned
 U+09F0 and U+09F1 instead of U+09B1 and U+09B5
 respectively?

Because the 7th and 8th rows in each of these Indic
scripts was where additions beyond the ISCII repertoire
were added.

  In the case of the Assamese letters, these 
  additions separate out the *distinct* forms for 
  Assamese /r/ and /v/ from the Bangla forms, and 
  *enable* correct sorting, rather than inhibiting it.
 
 I fail to understand why Assamese /r v/ wouldn't be
 correctly sorted if placed in U+09F0 and U+09F1.

I presume you mean U+09B1 and U+09B5.

The answer is that no Indic script is correctly sorted
simply by using code point order, anyway. You need
a more sophisticated algorithm. And since such an
algorithm will have weight tables, it doesn't *matter*
where a particular character is in the code chart.

See:

http://www.unicode.org/notes/tn1/

for a discussion of these issues. 

 Why
 do they need to be separated out from the Bangla forms
 in order to enable correct sorting?

So that a tailored sorting for Assamese can be based
on Assamese letters, and a tailored sorting for Bangla
can be based on Bangla letters.

 
  The addition of the long syllabic /r/ and /l/ 
  *enables* the representation of Sanskrit
  material in the Bengali script, and the code
  position in the charts is immaterial.
 
 As stated earlier, my objection is not against their
 inclusion, but against their positioning on the code
 chart. Why is their relative position in the chart
 immaterial for sorting? 

See the above technical note. If it will help you visualize
the answer in some way, here is an excerpt from the
Default Unicode Collation Element Table for the
Unicode Collation Algorithm (Version 4.0), showing the
default weight assignments for the relevant portion of the
Bengali script:

09AA  ; [.15C4.0020.0002.09AA] # BENGALI LETTER PA
09AB  ; [.15C5.0020.0002.09AB] # BENGALI LETTER PHA
09AC  ; [.15C6.0020.0002.09AC] # BENGALI LETTER BA
09AD  ; [.15C7.0020.0002.09AD] # BENGALI LETTER BHA
09AE  ; [.15C8.0020.0002.09AE] # BENGALI LETTER MA
09AF  ; [.15C9.0020.0002.09AF] # BENGALI LETTER YA
09DF  ; [.15C9.0020.0002.09AF][..00FD.0002.09BC] # BENGALI LETTER YYA; QQCM
09B0  ; [.15CA.0020.0002.09B0] # BENGALI LETTER RA
09F0  ; [.15CB.0020.0002.09F0] # BENGALI LETTER RA WITH MIDDLE DIAGONAL ---
09B2  ; [.15CC.0020.0002.09B2] # BENGALI LETTER LA
09F1  ; [.15CD.0020.0002.09F1] # BENGALI LETTER RA WITH LOWER DIAGONAL  ---
09B6  ; [.15CE.0020.0002.09B6] # BENGALI LETTER SHA
09B7  ; [.15CF.0020.0002.09B7] # BENGALI LETTER SSA
09B8  ; [.15D0.0020.0002.09B8] # BENGALI LETTER SA
  
  primary weights, in sorted order
  
As you can see, the two additional letters in question,
in the default table, sort in exactly the order you
are suggesting, and as I said, the position in the
*code chart* doesn't matter.

 If it is merely because there
 are script-specific sorting mechanisms already in
 place, then it's just a bad excuse for a sloppy job. I
 sincerely hope there is more to it than just that.

It truly does not matter. *No* script in the Unicode
Standard is encoded completely in a collation order.
*All* scripts must be handled via weight tables in
order to produce desired sorting behavior. That is
true for Latin, Greek, Cyrillic, ..., as well as Devanagari,
Bengali, Gujarati, ..., so this is nothing particularly
different about the encoding of Bengali.

 
  But be that as it may, they (TDIL) have nothing to 
  do with the code point choices in the range 
  U+09E0..U+09FF ...
 
 If this is indeed the case, then I must say it's
 rather unfortunate. As a full corporate member
 representing the Republic of India, the Ministry of
 Information Technology should have had a BIG say in
 the matter. Were they ever consulted on the issue? 

Of course, once they got involved. And they have been
making suggestions ever since. But you need to recognize
that the particular characters you are concerned about
were standardized and published by ISO in 1993 (based,
it is true, on charts published by Unicode even earlier,
which in turn were based on the ISCII standard),
well before the Government of India became a member of
the Unicode Consortium.

--Ken

 Did
 they try to intervene suo moto? Will a Unicode
 official kindly let us know? Best, -Gautam.




Common XML Data Locale Repository V1.0 Alpha Available!

2003-10-09 Thread Vladimir Weinstein
Forwarded on behalf of Helena Chapman:

The OpenI18N WG of the Free Standards Group is pleased to inform you that CLDR 
(Common XML Locale Data Repository) V1.0 Alpha snapshot is available. The CLDR 
repository provides application developers a consistent and uniform resource in 
managing the locale-sensitive data used for formatting, parsing, and analysis. 
It also includes the comparison charts that demonstrates the locale data 
differences on various platforms.

For details on the locale data comparison charts, please see 
http://oss.software.ibm.com/cvs/icu/~checkout~/locale/all_diff_xml/comparison_charts.html. 

The V1.0 alpha is available at 
http://oss.software.ibm.com/cvs/icu/~checkout~/locale/common/xml/ via CVS under 
the tag release-1-0-alpha.
To report problems, comments or defects, please submit a bug report at 
http://www.openi18n.org/locale-bugs/public.
The V1.0 Locale Data Markup Language specification on which the CLDR data is 
based upon can be found at http://www.openi18n.org/specs/ldml/.

Thank you.

Regards,

Helena Shih Chapman
Manager, SWG Customer Satisfaction, Quality and ISO9K2K
Co-Chair of OpenI18N / Free Standards Group
--
Vladimir Weinstein, IBM GCoC-Unicode/ICU  San Jose, CA [EMAIL PROTECTED]



Public Review Issue #23

2003-10-09 Thread Mark E. Shoulson
Looking over the Public Review Issues... trying to scramble up the 
learning curve and make sense of some of what it's talking about... 
Here's a comment.

I think U+05C3 HEBREW PUNCTUATION SOF PASUQ should probably also be in 
Sentence_Terminal.  I suppose it's true that there are Biblical verses 
that are not complete grammatical sentences, but that's true of a lot of 
what gets marked as sentences.  It certainly would obey the Principle of 
Least Astonishment, for me, if I hit the move one sentence forward key 
and it jumped to the next verse.

Comments?

~mark




RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

2003-10-09 Thread Gautam Sengupta



--- "Unicode (public)" [EMAIL PROTECTED] wrote:  Gautam-- 
 ...
 I don't understand. If the option to go to an alternative model is notavailable, why is it
 important to know that the alternative model would have been preferable? 

[Gautam]: Just for the sake of knowing, I guess. "... ripeness is all". [Gautam]: I think there is a slightmisunderstanding here. TheZWJ I am proposing is 
 script-specific (each scriptwould have its own),call it "ZWJ PRIME" or even "JWZ"
(in order to avoidconfusion withZWJ). It doesn't exist yet and hence has no  semantics.   Okay. Maybe I'm dense, but this wasn't clear to mefrom your otheremails. 

[Gautam]: Heavens, no! It must be my non-native English that's creating all these communication gaps.

 You're not proposing that U+200D be used tojoin Indicconsonants together; you're
 basically arguing forvirama-likefunctionality that goes far enough beyond what the  virama does thatyou're not comfortable calling it a virama anymore. 

[Gautam]: Indeed. You got it just right. Let us introduce the term "Ind VIRAMA" to refer to the virama used in Sanskrit and other Indic languages, and Uni VIRAMA" to refer to the virama in Unicode. The two are *not* identical. Uni VIRAMA lacks the full functionality ofInd Virama. I am proposing two extensions to Uni Virama: 

1. extension of its functionality to allow cons+combining vowel to be encoded as ConsVIRAMAfull Vowel, and 

2. extension of its functionality further to allow vowel+yophola to be encoded as VowelVIRAMAfull Y

(1) merely confers on Uni VIRAMA the full functionality of Ind VIRAMA, making the two functionally identical. 

(2) is a hack, a crude ad hoc solution to the problem of how to encode Bangla vowel+yophola sequences. It is THIS latter extension that would make Uni VIRAMA un-VIRAMA-like, and hence my discomfiture with the name "VIRAMA". But (2) can be avoided if we can find some other solution to the YOPHOLA problem, such as assigning a code point to YOPHOLA in addition to the one already assigned to Y. And this (that is, addition of a distinct YOPHOLA on the code chart), by the way, would also disambiguate RY sequences in Bangla. (See Paul Nelson, "Bengali Script: Formation of the Reph and use of the ZERO WIDTH JOINER and ZERO WIDTH NON-JOINER"). I now feel that it is better to avoid extension 2 for the sake of keeping the model clean. Let us say we find some other acceptable solution to the problems raised by combinations involving YOPHOLA.

 To summarize:   Tibetan deals with consonant clusters by encodingeach of the consonantstwice: One 
 series of codes is to be used for thefirst consonant in acluster, and the other series is 
 to be used for the others. The Indianscripts don't do this; they use a single series 
 ofcodes for theconsonants and cause consonants to form clusters byadding a 
 VIRAMA codebetween them. But the Indian scripts still have twoseries of VOWELS 
 more or less analogous to the two series ofconsonants in Tibetan. Whenyou want a 
 non-joining vowel, you use one series,and when you want ajoining vowel, you use the 
 other.

[Gautam]: In UnicodeIndic CV and CC sequencesare treated differently. It uses the VIRAMA model for CC clusters, but the Tibetan model for CV's. I am suggesting the use of the VIRAMA model for BOTH. You want to have one series of vowels and extend thevirama model tocombining
 vowels. Thus, you'd represent KI as KA +VIRAMA + I; KA + Iwould represent two 
 syllables: KA-I. 

[Gautam]: Yes.

 Since a realvirama never doesthis, you're using a different term("JWZ" in yourmost 
 recent message)for the character that causes the joining tohappen. 

[Gautam]: No, the*real* Ind VIRAMA doesexactly this. Hence with this extension only (that is, as long as extension 2 is not implemented) I feel no compulsion to rename VIRAMA.
 You're notproposing any difference in how consonants aretreated, other thanhaving 
 this new character server thesticking-together function that theVIRAMA now serves 
 and changing the existing VIRAMAto always displayexplicitly. Now do I understand you? Sorry for my earliermisunderstandings.

[Gautam]: Yes, butnote the clarifications providedin the preceding paragraphs.
 Now that we have freed up all those code pointsoccupied by thecombining forms of 
 vowels by introducing the VIRAMAwith extendedfunction, let us introduce an explicit 
 (alwaysvisible) VIRAMA. That'sall.   As far as Unicode is concerned, you can't "free up"any code points.Once a code
point is assigned, it's always assigned.You can deprecatecode points, but that 
 doesn't free them up to bereused; it only (withluck) keeps people from continuing to 
 use them. 

[Gautam]: This is just too bad. It seems to me that a system could support the usageyou want and theold usage at 
 the same time. I could be wrong, butI'm guessing that KA+ VIRAMA + I isn't a 
 sequence that makes any sensewith currentimplementations and isn't being used. It 
 would bepossible to extendthe meaning of the current VIRAMA to turn the 

Re: Public Review Issue #23

2003-10-09 Thread Ted Hopp
On Thursday, October 09, 2003 11:19 PM, Mark E. Shoulson wrote:
 Looking over the Public Review Issues... trying to scramble up the 
 learning curve and make sense of some of what it's talking about... 
 Here's a comment.
 
 I think U+05C3 HEBREW PUNCTUATION SOF PASUQ should probably also be in 
 Sentence_Terminal.  I suppose it's true that there are Biblical verses 
 that are not complete grammatical sentences, but that's true of a lot of 
 what gets marked as sentences.  It certainly would obey the Principle of 
 Least Astonishment, for me, if I hit the move one sentence forward key 
 and it jumped to the next verse.
 
 Comments?

I agree.

Ted


Ted Hopp, Ph.D.
ZigZag, Inc.
[EMAIL PROTECTED]
+1-301-990-7453

newSLATE is your personal learning workspace
   ...on the web at http://www.newSLATE.com/