Re: When to validate?

2004-12-10 Thread Antoine Leca
Arcane Jill va escriure:

> And yet, in an expression such as tolower(trim(s)), the second
> validation is unnecessary. The input to tolower() /must/ be valid,
> because it is the output of trim(). But on the other hand, tolower()
> could be called with arbitrary input, so I can't skip the validation.

What is a "string" ?

If you are using C as language, this is a particularly valid question, since
the output can be larger than the input in general, so you will be resorted
to add a maximum length parameter, or allowing realloc() which means that
the strings are required to be malloc'ed on first place.

As a result, your strings are likely to be some stuctures.
Then, it is pretty easy to add some s_valid flag, and you are done.


Antoine



Re: ISO 10646 compliance and EU law

2004-12-27 Thread Antoine Leca
On Sunday, December 26th, 2004 5:54 a.m. (!)
Philippe Verdy va escriure, entre altres:
>
> In the EU legislation, there are tons of references to "languages",
> but much less about "script systems";

However, there is a well known case about them. In 1997, when it was about
the building of the Euro monetary unit, Greece conditioned its Yes ballot to
the fact that the Greek writing system would show up on the notes, whatever
were the accorded name. And so does it.

In a few years from now, when Bulgaria will join the EMU, we will see if
they (Bulgarian) will choose  or  for the name of the currency. 
And
in the latter case, if we should modify the notes to include one more
spelling (according to Unicode/10646,  is very different to ) 
even
if they look like very similar...


Antoine




Re: ISO 10646 compliance and EU law

2004-12-27 Thread Antoine Leca
[ I am not subscribed to hebrew list, so I do not post there; feel free to
relay if it is worth the value. I will not subscribe to this list just to
post it, and since Elaine did not explain on which list she want the
discussion to take place, I choose the list I am subscribed to. ]

On Thursday, December 23th, 2004 22:38Z E. Keown va escriure:
>
> He thought that it is illegal under certain
> circumstances to sell non-ISO 10646-compliant software
> in the EU.

Hmmm, that would mean that GSM SMS encoding (I believe it is called ERMES),
which blatantly violates ISO/IEC 10646 frame (the same codepoint may encode
both a Greek and a Latin character, based on uppercase lookalike), will
become illegal...
If it turns to be true, it will be hot news to everybody including her
brother, at least here in Europa!

Also, conformance to 10646 (very different from compliance to Unicode)
requires the mention of the UTF, the implementation level and the
collections used. So it would mean that ANY software (that deals with
strings, and to which the "certain circumstances" might apply) selled here
should make these parameters clear.
Well, looks like about nobody is complying the law :-).
Or else, that the "certain circumstances" are rather narrow.


> That is, if I develop Hebrew/Castilian software which
> uses custom combining classes or other deviations from
> Unicode Hebrew, this software cannot be sold in Spain
> (in EU since 1986).

If you are developping this kind of software, here in Spain it will
undoubtly be considered as scholarship work (because of the Ladino etc.
heritage). And I doubt you can invoke a general-public regulation to apply
to this kind of work: first politicians are not that dumb, second almost
everybody will understand it is benefical to Spain to allow this kind of
software to enter, even if it means looking at another direction, and third
anyway scholars will not attend such limitations: if they need the software,
they will use/buy it.

So, do not worry, you can continue to develop your software, and taylor it
to the Spanish "market", even it does not comply with any and all of the
requisites of ISO/IEC 10646 (even if I miss the deviations that may qualify
as non-conformance, particularly since "The rules for forming the combined
graphic symbol are beyond the scope of ISO/IEC 10646."; just specify you are
using implementation level 3.)

( Of course, another thing could be that your software might directly
compete with one done here, and that the local software company may invoke
such a regulation to try to keep you out of their home market; just like you
can try similar tricks in Canada. But then we are speaking about disguised
tool barriers, which is a much more wide subject. And can apply to anywhere
in the world. And there are laws here in the EU to beat these kind of
barrers. )


Antoine




Re: Code2000 on SourceForge (was Re: [indic] Re: Lack of Complex script rendering support on Android)

2012-02-03 Thread Antoine Leca
James Kass wrote:
> Of course I put these three Code2nnn fonts on SourceForge, being sick of 
> their further development and whole commercial aura around them.
Thanks for your work contributing to Unicode and to the whole community.


Antoine



Re: Code2000 on SourceForge

2012-02-03 Thread Antoine Leca
Christoph Päper wrote:
> James Kass:
>
>> License already included in SourceForge download, namely GPLv3.
> You probably want to use GPL+FE, i.e. GPL with font exception.
> 
I am not completely sure you want to embed Code2000 with a document you
intent to distribute.


Antoine



Re: Devanagari Letter Short A

2004-02-18 Thread Antoine Leca
Philippe Verdy va escriure:
> 
> U+0904 DEVANAGARI LETTER SHORT A is used only for the case of an
> independant vowel. It can be "viewed" as a conjunct of the
> independant vowel U+0905 DEVANAGARI LETTER A and the dependant
> vowel sign U+0946 DEVANAGARI VOWEL SIGN SHORT E (noted "for
> transcribing Dravidian vowels" in the Unicode charts).

You may regard it this way, but that is not so.
U+0905 followed by U+0946 is really U+090E. Compare with the other
scripts to understand why.

> I  don't know why this is not documented, because I can find various
> sources that use  or  which have exactly the
> same rendering and probably the same meaning and usage.

Whow! You have various sources that use a character added to Unicode
about 2 years and half ago! Impressionnant!

About the rendering of , since it violates the usual
rules, it is up to your system. Mine does not render it properly,
though (unless I cheat).

> I think that U+0946 was added in ISCII 1991 but was absent from ISCII
> 1988

No. It was there even in ISCII 83.

> (I think it's too late to define it: ISCII 1988 has been used 
> consistently before,

H... I have really no evidence that ISCII 1988 was used at all...
Would be happy to find one, though...


Antoine




Re: Devanagari Letter Short A

2004-02-18 Thread Antoine Leca
Ernest Cline wrote:
> 
> I've been trying to make sense of the Indian scripts, but am
> having one small difficulty.  I can't seem to find the ISCII 1991
> equivalent for U+0904 (DEVANAGARI LETTER SHORT A).

I do not believe you'll find it there.
U+0904 had been added to Unicode for version 4.0. In 2001.
http://www.unicode.org/consortium/utc-minutes/UTC-089-200111.html>
Search for 89-C19.


> Is this a character that is part of the set accessed by the
> extended code (xF0) or was this part of the ISCII 1988
> standard that did not survive the changes to ISCII 1991?

No and no.

 
> Alternatively, does ISCII encode this as xA4 + xE0 as this
> would seem to generate the proper glyph even tho it
> violates the syllable grammar given in Section 8 of ISCII?

It does not. At the very least, if you want to generate this
character in ISCII this way, try A4 DB E0 (using INV).
This is an ugly hack, of course.

As an aside, in some version of ISCII (EA-ISCII, notably),
A4 E0 is supposed to be equivalent to AD. This is the way
the alphabet is sometimes taught to children in India.

 
Antoine



Re: Filenames with non-Ascii characters

2004-02-24 Thread Antoine Leca
Kenneth Whistler wrote:
>
> Dipti Srivastava asked:
>
>> If I set my LC_TYPE to en_US.UTF8 do I need to convert the non-Ascii
>> characters like '\' in the filename for functions like open, etc.
>
> '\' *is* an ASCII character. 0x5C in ASCII to be exact. It is
> also 0x5C in UTF-8, so no (other) conversion is required.

Looks like the classic misunderstood about different charset (note that I do
not have the original headers).

I understand Dipti was really writing
>> If I set my LC_TYPE to en_US.UTF8 do I need to convert the non-Ascii
>> characters like 'Â' in the filename for functions like open, etc.

which is what he can see on his display.

On the other hand, Ken saw (transformated in U+FF3C using presentation forms
to make it unambiguous):

>> If I set my LC_TYPE to en_US.UTF8 do I need to convert the non-Ascii
>> characters like 'ï' in the filename for functions like open, etc.

... and reacted accordingly.


Hope it helps. And hope that everybody use Unicode everywhere soonest as
possible, but I know this is somewhat vain!

Antoine




Re: What's in a wchar_t string on unix?

2004-03-02 Thread Antoine Leca
Rick Cameron asked:
> It seems that most flavours of unix define wchar_t to be 4 bytes.

As your "most" suggests, this is not universal. What if it is 8-byte?  ;-)

> If the locale is set to be Unicode,

That part is highly suspect.
Since you write that, you already know the wchar_t encoding (as well as char
one) depends on the locale setting. Few person has this right. So you then
also know that "wchar_t is implementation defined" in all the relevant
standards (ANSI, C99, POSIX, SUS). In other words, this says, answer is in
the documentation for YOUR implementation.

Now, we can try to guess. But there are only guesses.

> what's in a wchar_t string? Is it UTF-32, or UTF-16 with the code units
zero-extended to 4 bytes?

The later is an heresy. Nobody should be fool enough to have this. UCS-2
with the code units zero-extended to 4 bytes might be an option, but if a
implementor has support for UTF-16, why would she store extended UTF-16 (in
whatever form, i.e. split or joined, 4 or 8 bytes) in wchar_t? Any evidence
of this would be a severe bug, IMHO.

Back to your original question, and assuming "the locale is set to be
Unicode", there is as much possibility to encounter UTF-32 values (which
would mean the implementation does have Unicode 3.1 support) than
zero-extended UCS-2 (case of a pre-3.1 Unicode implementation). Other values
would be very strange, IMHO.

Recent standards has a test feature macro, __STDC_ISO_10646__, that if
defined will tell you the answer: defined to be greater than 1999xxL will
mean UTF-32 values. Defined but less than 1999xxL will probably mean no
surrogate support, hence zero-extended UCS-2. Undefined does not tell you
anything.
Unfortunately, this is also the most current setup.



Frank Yung-Fong Tang answered
> The more interesting question is, why do you need to know the
> answer of your question. And the ANSI/C wchar_t model basically
> suggest, if you ask that question, you are moving to a wrong direction

I am not that sure. I agree that the wchar_t model is basically a dead end
nowadays. But until the new model (char16_t, char32_t) get formalized and
implementated, it is better than nothing, since implementers did try to have
it right. Depending of your degree of conformance required, and also of the
allowance you give to having to bring in something heavy (this could rule
out ICU, for instance), then the minimalistic wchar_t support might help.



Philippe Verdy wrote:
> What's in a wchar_t string on unix?What you'll put or find in wchar_t
> is application dependant.

Disagreee. The result of mbtowc is NOT application dependant. It is rather
implementation dependant, which might be rather more disturbing...

> But there's only a guarantee to find a single
> code unit (not necessarily a codepoint) for characters encoded in the
> source and compiled with the appropriate source charset.

Can't parse that.

> But this charset is not necessarily Unicode.

This, you know at the moment you are compiling (this is not the same of the
result of using the library function, by the way).

> At run-time, functions in the standard libraries that work with or
> return wide strings only expect these strings to be encoded
> according to the current locale (not necessarily Unicode).
> So if you run your program in an environment where the locale is
> ISO-8859-2,

... you are answering something completely opposed from what he asked, since
it specified

: > If the locale is set to be Unicode,

> you'll find code units whose value between 0 and 255 match their
> position in the ISO-8859-2 standard,

That is wrong. When "your locale is ISO-8859-2" (whatever that may really
meant), you know next to nothing to encoding used for wchar_t. It might be
ISO-8859-2 (case of the degenerate case when wchar_t == char), it might be
Unicode (best probability on Unix if wchar_t is 4 bytes), or it might even
something very different like a flat EUC-XX (on some East-Asian flavour of
Unix). Only thing you know for sure, it is not EBCDIC!


> A wchar_t can then be used with any charset whose minimum code unit size
is
> lower than or equal to the size of the wchar_t type.

Wrong again. "any" is too strong. There are many charsets that while being
"smaller" than some other, cannot be shoe-horsed to enter into th encopding
of the wider form. For example, is wchar_t is 2 bytes and hold values
according to EUC-JP, you cannot encode Big-5 or ISCII with it, even if the
minimum code size is equal or even less: this is because all needed
codepoints are not defined in EUC-JP.

Unicode among its properties, does have the one to encompass all existing
charsets, so it aims at satisfying the property you spelled. But the mere
fact it is an objective of Unicode should show that all other existing
charsets do not satisfy the property.


> wchar_t is then only convenient for Unicode,

I cannot see from what you are inferring this.

> However a "wide" string constant (of type wchar_t*) should be able
> to store and represe

Re: What's in a wchar_t string on unix?

2004-03-02 Thread Antoine Leca
Hi Frank,


Sorry to be in disagreement on a couple of points.


On Tuesday, March 02, 2004 5:54 PM, Frank Yung-Fong Tang wrote:

> Antoine Leca wrote on 3/2/2004, 5:50 AM:
>
>  > Rick Cameron asked:
>  >
>  > > If the locale is set to be Unicode,
>  >
>  > That part is highly suspect.
>  > Since you write that, you already know the wchar_t encoding
>  > (as well as char one) depends on the locale setting.
>
> no, not true.

What is not true?

> the wchar_t is depend on the COMPILER

Yes

> and C LIB implementation,

Yes.

> not depend on the locale setting.

Yes it does. That is, the wchar_t encoding CAN change at run time if you
call setlocale(LC_CTYPE, ...)

I know this is not current behavour (fortunately), but it does happen with
some libc. And regarding the standard, this IS allowed behaviour.


> For example, wchar_t in MS Windows is defined by Microsoft

In this particular example, yes, wchar_t encoding never changes (and stays
16-bit UCS-2).

But there are other compilers and other environments in the world.


> But again, that is defined by who wrote gcc and gnu version of lib c.

This I agree with. Particularly the latter...


> It is NOT locale dependent (unless a particular c lib implementaion
> define so)

Here I am in agreement!


About the rest, this is factually correct. I am in disagreement with the
ideas, but I already exposed mine, so no need to repeat it.


Antoine




Re: Font Technology Standards

2004-03-03 Thread Antoine Leca
C J Fynn va escriure:

> [ The only thing there has been any real controversy or concern about
> are three Apple patents relating to grid fitting glyph outlines of
> TrueType fonts (see: http://www.freetype.org/patents.html )

> Also AFAIK Apple have never threatned anyone with
> enforcement of these patents. ]

Apple did not publicly (AFAIK). As David writes, they are happy signing OEM
licenses!

Sampo Kaasila (the original designer) did bring out the subject; it was not
a direct threat, though, rather casually mentionning that use of Freetype
without license for the patents would be illegal. I feel it was fair, by the
way, so no pun intended: TypeSolutions' product, T2K, was directly competed
by the new-born Freetype. In fact, it was this interview, now seemly
vanished, that did make us (Freetype) understand that there was some problem
here.


Antoine




Re: What's in a wchar_t string on unix?

2004-03-03 Thread Antoine Leca
Frank Yung-Fong Tang va escriure:

> Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__ is defined?
> or does it only mean wchar_t hold the character in ISO_10646
> (which mean it could be 2 bytes, 4 bytes or more than that?)

The later. But if wchar_t is 16 bits, it can only encode Unicode 3.0 or
before. ie no UTF-16 support.


Antoine




Re: Font Technology Standards

2004-03-03 Thread Antoine Leca
[sorry for the involontary x-post]

Frank Yung-Fong Tang va escriure:
> For example, we can standarlized a set of Arabic glyphs with their
> encoding.

Think about Nastaliq (rather than Naskh). There is simply no way to have it
done. Too much possibilities.

Idem for Latin (resp. Cyrillic, resp. Greek, whatever) ligatures.


Antoine




Re: What's in a wchar_t string on unix?

2004-03-03 Thread Antoine Leca
On Wednesday, March 03, 2004 7:28 PM Clark Cox va escriure:

> From the C standard:
> 
> __STDC_ISO_10646_ 

The current text is publicly available at
http://anubis.dkuug.dk/jtc1/sc22/wg14/www/docs/dr_273.htm>


Please use the reformed form (at the end) in place of the old one.

Thanks in advance.


Antoine




Re: What's in a wchar_t string on unix?

2004-03-04 Thread Antoine Leca
On Wednesday, March 03, 2004 11:22 PM  Peter Kirk va escriure:

>>> Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__ is
>>> defined? or does it only mean wchar_t hold the character in
>>> ISO_10646 (which mean it could be 2 bytes, 4 bytes or more than
>>> that?)
>>
> On 03/03/2004 11:27, Antoine Leca wrote:
>
>> The later. But if wchar_t is 16 bits, it can only encode Unicode 3.0
>> or before. ie no UTF-16 support.
>>
> Surely if wchar_t is 16 bits, it CAN be used to encode the whole of
> Unicode with UTF-16, i.e. with supplementary plane characters
> represented as "surrogate pairs" in pairs of wchar_t.

OK, right, the programmer CAN put whatever she wants into a wchar_t (or a
unsigned short, for that matter).

I was speaking about what the compiler+libc was expecting to find and to
handle correctly. Sorry for the inexact words.

> Whether these
> characters SHOULD be represented as UTF-16 code units in a wchar_t
> string (or whether representation should be either UCS-2 or UTF-32)
> is a separate issue, probably related to how the associated libraries
> handle the code units for surrogates.

And also to the level of support the compiler offers for the \U00xx
notation.

As I wrote in other posts, an otherwise compliant compiler,
 - using 16-bit wchar_t, and
 - defining __STDC_ISO_10646__ to something (which should be less
than 200111L, date of publication of ISO/IEC 10646-2:2001,
first one that defined the use of the external planes)
cannot conformingly interpret the \U00xx notation in a L"" string
constant if xx is not 00, because it would then fails to conform to the
requirement that any character should be represented in a single wchar_t
(more exactly, it can do it, but should emit some warning, because the
character does not fit into one wchar_t).

I usually say then that a compiler with 16-bit wchar_t can only encode
UCS-2, not UTF-16. In other words, the management of UTF-16, such as keeping
together the pair of surrogates, pairing them when transcoding to something
else such as UTF-8, etc., should be done by the user (or externaly provided
libraries, obviously), because there are no way to tell if the standard
library does it or no.
That's said, it CAN be done, as Peter rightly said. And the rest of the job,
that is, the handling of BMP codepoints, can be left to the compiler/system
libraries, thanks to the support advertised by the #definition of
__STDC_ISO_10646__.

On the other hand, an (hypothetic, as Nelson showed) compiler/library that
defines __STDC_ISO_10646__ to be 200111L (and provides 32-bit or wider
wchar_t, of course), does assure that all the managing of the surrogates are
done correctly by the standard library and associated support. As such,
iswupper(L'\U00010400') (DESERET CAPITAL LETTER LONG I) should not return 0.


Antoine




Re: What's in a wchar_t string ...

2004-03-04 Thread Antoine Leca
On Thursday, March 04, 2004 2:21 PM, Arnold F Winkler va escriure:

> Since "ISO/IEC 9899 - Programming Language C" was quoted, I wonder if
> you are aware of the efforts of SC22/WG14 to develop a Technical
> Report that deals with the problems discussed in this thread.
>
> The document is ISO/IEC DTR 19769 - Extensions for the programming
> language C to support new character data types
>
> The project is currently in DTR ballot and will, when approved,

According to http://anubis.dkuug.dk/jtc1/sc22/wg14/> the ballot ended
on February 12th. Unfortunately, I did seen it far too late, and did not
found some hours to forward some comment to helpful persons (I am now
resident in an excluded country).

Furthermore, I did note that I was not able to found discussion about this
is news:comp.std.c, which is normally quite active discussing evolution of
this standard.


Antoine




Re: What's in a wchar_t string on unix?

2004-03-05 Thread Antoine Leca
Hi Rick,

On Thursday, March 04, 2004 6:56 PM, Rick Cameron va escriure:

> Woo-hoo! Finally, a real answer,

I am sorry for you, but when one posts to some high-volume mailing list, he
should expect a rather bad signal/noise ratio; this is often seen as an
opportunity to get some really good answers from people that usually you did
not think about; on the other hand, there are always a number of answers, or
even part of posts, that are unuseful, and some even misleading.

Another view is that these high-volume lists are set up to give opportunity
for a lot of readers to learn things. In this sense, I believe some people
did learn things about how to manage Unicode datas with C. I hope they were
not too much misleaded, though.


> rather than speculation.

> -Original Message-
> From: Ienup Sung [..]

> I'm also quite sure all major Unix/Linux systems support the
> functions that I mentioned. (I also believe majority will support
> UTF-32BE, UTF-32LE and such variations too in the iconv() code
> conversions by the way.)
>
> Additionally, since POSIX defines wchar_t as an opaque data type, we
> hope that people are using the std C interfaces to do conversions
> between wchar_t and multibyte characters if possible.

I also saw speculations in Mr Ienup Sung's post... Looks like it is human
nature, I'd say.


Antoine




Version(s) of Unicode supported by various versions of Microsoft Windows

2004-03-05 Thread Antoine Leca
Hi folks,

I discovered, to much of my surprise (but after reflexion it does hold much
sense, taken in account the dates when it were developped), that Windows
2000 only support The Unicode Standard, version 2.0
http://support.microsoft.com/default.aspx?scid=kb;EN-US;227483>

The question, I was unable to find a similar information refering to Windows
NT version 5.1 and 5.2.

Certainly people here may direct me to the correct place to find it.  Thanks
in advance.


(Please, do not tell me "it supports 4.0 since you can view 4.0 provided you
use the correct browser and the correct fonts"; that is NOT what I want to
know. I am interested for example in sorting strings with surrogates; seeing
that in a typical WinXP distribution, %SYSTEM32%/SORTKEYS.NLS is still 256k
like it was with NT3.x, shows me that this one would not support Unicode
3.1, for instance).

A similar query has been directed to Dr. International
http://www.microsoft.com/globaldev/drintl/askdrintl.aspx>


Antoine




Re: Version(s) of Unicode supported by various versions of Microsoft Windows

2004-03-05 Thread Antoine Leca
Hi Michael,

Michael (michka) Kaplan va escriure:

> For sortkey.nls -- that file does not ever change in size, as it is
> not a file that one adds characters to.

Well, I do not believe this is the most adequate place to discuss this, but
here is my view about it.

The sorting algorithm of NT, since NT 3.1 (and in fact NLS part of OLE 2
too), use a big table of the weights attributed to each characters. This
table is even partly user-visible via the LCMapString APIs (so you do not
need to infringe any law about reverse engineering to understand all this
stuff; nor I did: this is pure black-box observation; OTOH, using the
content of this table would be a clear copyright infringement, so do not do
that).

Internally, in (Unicode-enable) NT, contrary to the variants used with
Windows 3.1/9x and probably CE, the table is decomposed in two parts, one is
the locale-dependent tailoring (in SORTTBLS.NLS), and a common part in
SORTKEY.NLS. And, since NT 3.1, this SORTKEY.NLS file is 262144 bytes in
size. One cannot miss that 262144 is 4*65536, and indeed the structure of
this file confirms without doubt that each character is mapped, in Unicode
order, to the 4 weights (and I personnaly did not miss it, because back in
1994, a 256 Ki file was quite an bulky thing to deal with, particularly
since I did only have 16-bit, DOS-based, tools).

Now, the file in XP is still exactly 262144 bytes in size. To me, this is
evidence that only the BMP characters did receive weights in this file.
Since SORTTBLS.NLS is still a ridiculous 20 k in size, it does not hold the
weights.

So I deduce from it that Outer-Plane characters are probably non sorted in
XP, or in other words that the Win32 API available with XP does not fully
support Unicode 3.1 (furthermore, since Whistler was developped around year
2000, and more earlier than later, while at the same time Unicode 3.1 was
issued 2001-05-16, it would be very surprising if it support it).

Now, what I do not know is:

 - if the Win32 NLS API has been fully upgraded to Unicode 3.0 for XP. I was
thinking that when I did research it earlier today, since the sizes of the
.NLS files did accordingly increase, but since I did not find the relevant
KB article I was not sure. Michael's approximate answer (I beg your pardon
if this was not the intent) that may lead to think it is an almost-full,
almost-empty pot, is not a very good news

 - what is the status with NT 5.2 a.k.a. Server 2003, since I do not have
access right now to this version. A quick look to the size of SORTKEY.NLS
would give some hints: 256 Ki would say it is still at 3.0 level, 768 Ki
(Plane 0, 1 and 2, perhaps with some adjust to cover the delightful plane
14) would be an indication it supports meaningful surrogates without heavy
changes to the scheme, 4352 Ki (4.25 Mi, 17 * 256Ki) would say the
programmer did extend the table without even thinking about how to optimize
it (I do not think it happens, but who knows), and some much smaller size
would mean the algorithm was revised!

 - by the way, the same question can be asked with the beta releases of
Longhorn. However, there is not much point trying to nail down the level of
Unicode support of a beta.


Antoine




Re: Version(s) of Unicode supported by various versions of Microsoft Windows

2004-03-05 Thread Antoine Leca
On Friday, March 05, 2004 6:07 PM, Frank Yung-Fong Tang va escriure:

> Not sure how to find the information paper. But one way to check the
> degree of the support is to do a GetStringTypeEx agasinst some
> characters defined in 2.0, 2.1, 3.0, 3.1, 3.2, 4.0 to see does those
> return result reflect what it should be.

Nice idea.

Does GetStringTypeEx() works with surrogates characters? I doubt it (very
much).
And if it does, I would be very interested to learn how it does work.

Furthermore, I am lacking access to NT 5.2 (as I wrote elsewhere, the really
interesting case).


Antoine




Re: Version(s) of Unicode supported by various versions of Microsoft Windows

2004-03-05 Thread Antoine Leca
On Friday, March 05, 2004 6:39 PM, Peter Constable va escriure:

> People *really shouldn't* ask "Does product X support Unicode version
> N?" They should be asking questions like "Can product X correctly
> perform function Y on such-and-such characters added in Unicode
> version N?"

Fact is, conformance to Unicode, as specified in the standard, is something
of a fuzzy target. So after reading chapter 3 (conformance) and 2.12 (which
does not enlighted me very much), I reformulate the question as:

- For each version of Windows NT, what is the version of the Unicode
Character Database the NLS API intents to conform to?

A similar question could be asked in relation with sorting (but since I
cannot figure easily how TUS defines conformance in the area of sorting,
particularly some years ago, I cannot figure how to spell out the question)


Antoine




Re: Question on Unicode-prevalence (general and for Cyrillic)

2004-03-15 Thread Antoine Leca
Peter Kirk va escriure:
>> 2. A graduate student mentioned that it was her impression that most
>> Cyrillic webpages (at least for Russian--her interest) are still not
>> encoded in Unicode. (She is doing some research on the use of
>> certain words in Russian and wanted to know how best to do the
>> search.)
>
> Google finds matches not just in Unicode
> encoded pages, but also in ones encoded in other Cyrillic encodings

On the other hand, if the student is willingfull to write some kind of
spider herself, this means it is very likely she shall contemplate all the
encodings, shan't she?


Antoine




Re: OT? Languages with letters that always take diacriticals

2004-03-16 Thread Antoine Leca
On Tuesday, March 16, 2004 4:12 PM
Radovan Garabik <[EMAIL PROTECTED]> va escriure:

> On Tue, Mar 16, 2004 at 02:24:14PM +, Marion Gunn wrote:
>> Scríobh Radovan Garabik <[EMAIL PROTECTED]>:
>>>
>> Irish in Roman script is written i with dot above,
>> Irish in traditional script is written i without dot above.
>
> You have to decide one basic philosophical question:
> is your dotless-i the same letter as our "i", only in your
> traditional font, or is it a different letter?

I let this to Marion

> E.g. if you write foreign name in Irish, let's say "Philadelphia",
> is it with dots or not?

But here, I can answer: you did not read what she wrote:
when writing with "Roman script", she writes a dot;
when writing with "traditional script", she does not.

> (For example, old German in Frakkur typeface has been decided to be
> just different font, but the same lattin letters as we know today)

Like U+017F?  ;-)


> If it is a different letter, then you should use U+0131 LATIN SMALL
> LETTER DOTLESS I where appropriate,

Well, going this way...

> and all should work smoothly

... not so sure...

> (except for spellcheckers and such,

... and keyboards, and existing applications, UIs, etc. and fonts that have
it wrong (rendering U+0069 dotless), and it needs very strange "Roman
script" fonts, where U+0131 should be rendered with a dot!

Here for sure you will surprise a lot of Turks, and even much more people!!!



Antoine




Re: Irish dotless I

2004-03-16 Thread Antoine Leca
On Tuesday, March 16, 2004 5:48 PM
Peter Kirk <[EMAIL PROTECTED]> va escriure:

> On 16/03/2004 07:35, Carl W. Brown wrote:
>
>> I suspect that just changing the font to eliminate the dot will be
>> easier. Software won't have to be changed, existing code pages will
>> not have to be changed, searches will work, etc.
>>
> It has the disadvantage of making these fonts useless for Turkish and
> Azeri,

How useful to Turks are the fonts used in France, where the style is small
case or upper case only, and all the accents are removed (and please do not
tell me that accents should be drawn on capitals: I know it, and I am not
speaking about writing a book for Imprimerie nationale, but rather of fonts
used for advertising, which was Marion's purpose).


> And of course the fonts would not be acceptable
> to most users of English and other Latin script languages.

Again (apart from being suspicious: I certainly can decrypt Latin written in
Uncial), what is the point?
Are French Canadians *required* to understand French specificities (and
vice-versa)?

On a similar vein, are everybody required to understand the "hacker" script?
(try http://www.google.es/search?q=hacker&hl=xx-hacker> to get my
point)


> On the other hand, the change to Unicode required for Irish to use
> dotless i would be rather trivial, simply adding Irish to the existing
> list currently consisting of Turkish and Azeri, to which Tatar,
> Bashkir, Gagauz, Karakalpak and various minority languages of
> Azerbaijan should also be added.

Remember that in Marion original post, Irish written in "Roman script"
should be written with dot (for example, because there are Turks living
there, and they surely want their names written correctly when using "their"
script; I know the name ought to be tagged for different language, but
everybody know this is unpractical)

So what you are proposing is to add another line to CasingRule, with the
added discriminant of the script.

Not as easy as one might expect.

Furthermore, it has the very useful (from the point of view of some; very
disastrous for others; and very expensive for many people, like me as
European paytaxer ;-) see below for the reasonning) property that since
Irish is a pretty common language with quite a bit of already encoded
material, all this material will suddendly become obsolete and will need a
careful examination to decide if any U+0069 should be recoded or not, based
on the fact it is Irish or it is not (I am thinking here about EU material,
where every non-Irish proper name inserted inside Irish text should NOT be
converted).

Also, Michael, tell us if your name when written inside some Irish text,
should it be considered English, or Irish? Then, should the i be dotted? And
if it is English, and not dotted, then we would add some more line to
CasingRules, wouldn't we?

;-)


Cheers to all,

Antoine




Re: Irish dotless I (was: Languages with letters that always take diacriticals

2004-03-22 Thread Antoine Leca
John Cowan va escriure:

> Pavel Adamek scripsit:
> 
>>> From the viewpoint of sorting,
>> the coding  
>> would be much better than
>> .
> 
> For Czech, yes.  For Spanish we want the latter.

What for?


Antoine




Re: Novice question

2004-03-23 Thread Antoine Leca
Hi John,

John Snow va escriure:
>
> I am speaking to a client regarding there website being translated in
> to a number of languages including Bengali, Urdu and Punjabi which I
> am told is not very well supported by Unicode.

This is not true. These languages are supported by Unicode, since the first
public version, more than 10 years ago.

However, they use so-called complex scripts, which are not easy to "write"
with a computer. The fact they are mainly used outside of the major
industrial countries did not help on this respect, as you might guess. As a
result, it is still quite difficult to have a readable web site written with
these languages *and* viewable without impairment.

Urdu is written with the so-called Arabic script, but in a way (`style')
that is visually quite distinct from the Arabic you might have already seen
(Urdu is written in Nastaliq style as opposed to Naskh style usually used
for the Arabic language). This requires usually distincts fonts, which are
harder to found and much less easily available than Naskh fonts. Of course,
your potential clients will probably have the fonts, but the key point here
is the potential diversity you may encounter. As I understand things (I did
not test all the browsers that Edward indicated), the ultimate versions of
the browsers should be able to work, provided the correct fonts are here
(for example, Arial Unicode MS from Office XP is not sufficient here: it
only has Naskh style). The other solutions outside Unicode (using the
so-called "font hacks") are *not* going to give you better results, I would
guess. And furthermore the future of such solutions is bad, since everyone
is moving toward Unicode.

Regarding Punjabi, this is probably the easiest. Depending on the country,
Punjabi is written either with the Arabic script (this is called Shahmukhi;
can used Naskh style) or in a dedicated script (called Gurmukhi), which is
not as easy to write as Latin, but is not overwhelmy complex either. The key
point here is this is not a #1 priority given the "market". As a result, I
am positive recent version of IE is able to display it (with Arial Unicode
MS as font). With other browsers, well, things are progressing, but there is
still work to be done. Particularly on "alternative" OS such as Linux and
likewise, since the OS is right now without native support for this script.
On the other hand, MacOS has Gurmukhi support for many years, so I expect
less problems on this platform (but I cannot be positive, my own Mac is far
too old). In Gecko this is something which is under development; I did not
check ultimately if it works. In Opera, it does work provided the OS
supports it (in fact, in Opera in general there is nothing specific about
any of these scripts, provided you are using Unicode and not some
"font-hack", and provided of course there is OS support).
So in general, pronostic about Punjabi is quite good (but keep in mind you
may have two versions, one in Shahmukhi for "Pakistan", one in Gurmukhi for
"India").

I left Bengali for the end, despite being among the 5 most spoken languages
in the world. Bengali is much more complex to write. Font disponibility is
scarse (for example, I do not have access to any of release quality).
Microsoft is definitively working in this area, so things will get better on
the Windows platform within the next years (alternatively, you can read it
saying there is still work to be done). On Linux, there is a group of
enthousiasts that are working hard to get results; due to the way things are
set up, their results are likely to be reusable on MacOS too.
Depending on the size of your project, going for "font hack" or other
similar solutions such as iPlugin from CDAC might be better alternatives at
short term. On the long term, of course, as above, Unicode is the correct
solution.


Again, please note Unicode does not hold any "culprit": Unicode is a
standard to encode texts. As such, it just works for these languages. And it
is definitively the standard that all browsers do follow to interpret the
"texts" they are sent. However, another thing is making this correctly
visible on screen. And this is where you may encounter difficulties
(particularly with Bengali).


Hope this helps,

Antoine




Re: Novice question

2004-03-23 Thread Antoine Leca
Philippe Verdy <[EMAIL PROTECTED]> va escriure:

> From: "Edward H. Trager" <[EMAIL PROTECTED]>
>> Also, I would not bother testing Windows OSes prior to Windows
>> 2000/XP.
>
> Why not?

Since it does not even work on these, there is no point testing it on
development-dead platforms either.


Antoine




Re: [OT] C-sharp

2004-03-23 Thread Antoine Leca
Philippe Verdy <[EMAIL PROTECTED]> va escriure:
>> The "musical sharp sign," of course, is U+266F, making the correct
>> spelling Câ.

>From TUS: " These symbols are typically used for text decorations, but they
may also be treated as normal text characters in applications such as
typesetting chess books, card game manuals, and horoscopes. "

;-) [replace with â if you want]


> But the "orthograph" is unambiguously "C#" with ASCII characters at
> least for its standard source file extension

Huh?
Never seen a source file with the .C# extension (nor Câ). Seen a lot of them
with .cs. The compiler itself is named csc.
And Câ does not seems to be an allowed identifier (neither is colÂlectiu, so
for me this language is bad ;-) And please Ken you are specifically
prohibited to comment on this one: you know we cannot agree on this one
:-D).

> (or spelled in French "C
> croisillon", or more commonly "C diÃse" even though it is the French
> spelling for the sharp musical symbol).

En franÃais, on devrait dire  do diÃse Â. Ce qui enlÃve beaucoup de sel.


Antoine




Urdu Unicode website [Was: Novice question]

2004-03-24 Thread Antoine Leca
Peter Constable va escriure:
>
> Urdu can be written using naskh-style Arabic (supported on WinXP,
> Win2K...),

Peter,

I do not see the connection between the OS support in Windows for a given
language and the traduction of a website, but while we are at this one: how
do you enter Urdu with Microsoft Windows 2000? I have a Spanish one with
SP4, IE6 SP1, Arabic script enabled. Surely something is missing, but where
can I find it? Should I use KLC?


Of course this is nitpicking because it is crystal clear to anybody here
that I have a wide range of ways to enter Urdu into my box without having to
resort to some MS-provided DLL (beginning with borrowing the XP DLL). Also
it is evident that a Spanish Win2000 box is not the correct representative
box for a typical Urdu reader. Sorry about this.


Antoine




Re: Urdu Unicode website [Was: Novice question]

2004-03-24 Thread Antoine Leca
On Wednesday, March 24, 2004 5:03 PM
Peter Constable va escriure:

>> how
>> do you enter Urdu with Microsoft Windows 2000? I have a Spanish one
>> with SP4, IE6 SP1, Arabic script enabled. Surely something is
>> missing, but where can I find it? Should I use KLC?
>
> My understanding is that Spanish Windows 2000 includes input methods
> for Arabic script.

This is my understanding, and also my experience, but I fail to see the
point in relation with Urdu.


Antoine




Re: Urdu Unicode website [Was: Novice question]

2004-03-25 Thread Antoine Leca
Philippe Verdy <[EMAIL PROTECTED]> va escriure:
>
> In my Windows XP, I have four keyboard layouts proposed for the Urdu
> language: "Arabic (101)", "Arabic (102)", "Arabic (102) AZERTY" and
> "Urdu", plus the keyboards for the Brahmic/ISCII transliterations in
> India,

What for a kind of keyboards is that?
XP generates Unicode, doesn't it? How can you made it generate ISCII
(besides inside an application, of course)?

Or are you meaning INSCRIPT instead?

> and the Tavulesoft "Urdu" keyboard layout, all of which can be
> added simultaneously to the language bar are selected from there.
> Isn't that enough?

[ Still do not see the relationship with web sites... ]


> I don't see where is the issue there. May be it's only for Windows
> 2000

If you are refering to the discussion about Urdu, the issue is that Peter
Constable pointed out that Urdu was "supported" (using naskh style, that is)
in both Microsoft Windows 2000 and Microsoft Windows XP. Apparently the DLL
for the Urdu keyboard slipped away or something from the shipping of Windows
2000, and the patch once available is not any more online. This was pointed
out to Peter, and he is now trying to improve the situation within
Microsoft. So Urdu typists with access to a Windows 2000 box might have in
the near future a supplementary option to type their language. Windows XP
users are not affected in any way by this discussion.

As Peter correctly noted from day 1, all this stuff is not very important,
since Urdu users really expect nastaleeq style, so either they are not using
Urdu support, or they use proprietary solutions which extent remains to be
explained by competent persons.

> (and of course Windows 9x/ME which does not support easily
> multiple layouts,

This is news to me.
What it does not support easily are other scripts like Gurmukhi or Bengali,
particularly on input ;-). Neither do the supplementary Arabic characters
needed for Urdu, for instance. For this very reason, one of the first
answers to the original question, made by Edward, correctly pointed out that
testing on 9x or 16-bit boxes would be probably useless.

> and where Tavulesoft Keyman is probably a good solution).

Tavultesoft (http://www.tavultesoft.com/keyman/>). I do not know the
extent of it. I am not competent this about. Their home page does not seem
to target specifically at the Urdu market, and historically they did not.
So I have no clue about the real extent of this solution to type Urdu into
IE/Gecko/Opera on 9x. I am not even sure it is really helpful (have to see
with WM_UNICHAR support, as you probably know; Peter should be able to tell
us if it works with IE; about Gecko, a quick search on mozilla.org returned
no matches...).
And of course if you have to type it first into Wordpad or Word then
cut-and-paste, well surely Unipad is a better solution then... and
definitively they are not operational.


Antoine




Urdu/Penjabi/Bengali website [Was: Novice question]

2004-03-25 Thread Antoine Leca
Hi Peter,

On Thursday, March 25, 2004 2:19 PM
Peter Kirk <[EMAIL PROTECTED]> va escriure:

> On 25/03/2004 03:33, Antoine Leca wrote:
>
>> As Peter correctly noted from day 1, all this stuff is not very
>> important, since Urdu users really expect nastaleeq style, so either
>> they are not using Urdu support, or they use proprietary solutions
>> which extent remains to be explained by competent persons.
>>
> There are Unicode Nastaliq fonts available,

Good to know. As I said, the point is to know to what extent they are used.

>>> (and of course Windows 9x/ME which does not support easily
>>> multiple layouts,
>> What it does not support easily are other scripts like Gurmukhi or
>> Bengali, particularly on input ;-).

> I disagree. There is no reason why these scripts cannot be displayed

Sorry Peter. I was refering myself to inputing. I know very well that
displaying is no problem for a long time (since IE5, in fact).

On another forum (dealing with fonts) I see every other week a request from
some Indian asking Microsoft to enable Windows 9x to input in some way Indic
scripts. This is happening for many years. Clearly, Microsoft will not
release that. Since in fact this is not very complex given what is currently
available, I believe this is a deliberate move from the part of Microsoft.


So as a result, and trying to get back at the original question, we have:

 - for any serious use with the language, this will require inputing data,
and thus Windows 2000 at the very least (and more likely XP according to
Peter Constable), XP sp2 for Bengali

 - if we are referring only to casual browsing, then downloading specific
fonts is not an option in my eyes, so again we are stuck, this time by the
fonts, to 2000/XP/XPsp2 according to the language (except perhaps Punjabi
written in Arabic script, which might be easier). Here obviously font hacks
and similar (iPLugin, EOT, etc.) is an alternative to Unicode worth
considering

 - for the serious reader but without any input (would be the case for an
extranet where any input would be only numbers), then yes this might be
accomodated using less powerful platforms, and typically Windows 9x, with
these additional points:
a) required downloading of some additional fonts in addition to default
b) for Urdu and Penjabi, perhaps upgrading of IE to some version (to
obtain the correct version of USP10.DLL); at least IE5 (released 1999)
c) for Bengali only, also required downloading [and installation...] of
an updated version of USP10.DLL, still to be published furthermore, the
legality of which action I have no clue about. Or alternatively a future
version of IE might furnish this, but I am not sure at all that these
versions will be able to install on Win98 or Me (I am sure they won't
install on 95). Here I will not hold my breath.
Once the update is done, Opera should work, and I expect Gecko should as
well. Not tested, though.


Antoine




Re: Printing and Displaying Dependent Vowels

2004-03-26 Thread Antoine Leca
Avarangal asked about 
> the requirements by educational establishments is the ability
> to print and display dependent vowels without dotted circles.

John Cowan answered:
> Avarangal scripsit:
> 
>> Can any one provide information on the sequences used for diplaying
>> and printing dependent vowels as standalones.
> 
> The standards-conforming way to do so is to precede the dependent
> vowel with a space character (U+0020).

Does it fullfil the need (i.e., displaying _without_ dotted circles).
If so, where is it written?


Antoine




Re: Printing and Displaying Dependent Vowels

2004-03-26 Thread Antoine Leca
Sorry to answer my own post.

> Avarangal asked about
>> the requirements by educational establishments is the ability
>> to print and display dependent vowels without dotted circles.
>
> John Cowan answered:
>> Avarangal scripsit:
>>
>>> Can any one provide information on the sequences used for diplaying
>>> and printing dependent vowels as standalones.
>>
>> The standards-conforming way to do so is to precede the dependent
>> vowel with a space character (U+0020).
>
> Does it fullfil the need (i.e., displaying _without_ dotted circles).
> If so, where is it written?

It seems many are thinking about the section in 2.10, titled "Spacing Clones
of European Diacritical Marks". I read it as applying to diacritical marks
(and perhaps only European ones, but the distinction looks like blurry to
me). Beginning of 2.10 makes quite clear that diacritics is only one class
(the most important, though) of combining characters. Indic dependent vowels
are another.

Also, something which is probably very relevant to Avarangal, fact is the
implementation from a major vendor in the field, Microsoft Uniscribe, does
retain the dotted circle (if present in the font; if not, you would probably
get the .missing glyph instead).


Antoine




Re: Printing and Displaying Dependent Vowels

2004-03-26 Thread Antoine Leca
Avarangal wrote:
> display dependent vowels without dotted circles.
>
> Can any one provide information on the sequences used for
> diplaying and printing dependent vowels as standalones.

Microsoft's Uniscribe allows you to display a dependent vowel with the
following sequence (to be followed precisely): U+0020 U+200D U+0Bxx. U+00A0
does not work. Neither does U+200D.
Also, this should be the first characters in the string passed to the
Windows API: if there are some characters before, they will not trigger the
special behaviour, and you will end with the circle.

Please note that trying to display something a bit more complex, like U+0020
U+200D U+0BC6 U+0BD7 or U+0020 U+200D U+0BBF U+0B82, will fail.

[ I am sorry for the misleading words I had in earlier answers to others. It
costs me some time to figure out exactly what does this tool. ]

Hope this helps,

Antoine




Re: Printing and Displaying Dependent Vowels

2004-03-26 Thread Antoine Leca
On Friday, March 26, 2004 7:12 PM, Philippe Verdy va escriure:

> Indic scripts are a bit unique by the fact that they have a syllabic
> structure decomposed into separate letters with a base consonnant and
> a "combining" (this is not the proper term for Unicode) vowel
> modifier after it. This differs from European alphabets (Latin,
> Greek, Cyrillic) or even from some Asian or African syllabaries
> (notably Hiragana/Katakana) where these grapheme clusters are (almost
> always) combining sequences are coded with a base character and
> diacritics.

Where exactly is the difference with say IPA?
And with Vocalized Perso-Arabic?

(And it is not all Indic scripts: Thai and Lao behave differently)


> Indic scripts offer several variations here because there are also
> half-forms for these vowels,

Please, define "half form for vowel". This is new to me.


> A sample with Devanagari could be:  <àà> (U+0905 LETTER A, U+093E
> VOWEL SIGN AA) which should normally be presented like the
> precomposed: <à> (U+0906 LETTER AA), but which incorrectly displays
> the dotted circle with the "Mangal" font.

Mangal has nothing to do with this. What you are seeing and critizing is
Uniscribe's implementation, fruit of a compromise between performances and
dealing with special/inusual cases. This case is not clearly specified by
the Devanagari Open Type specifications, but it appears that the default
behaviour (considering U+093E as dependent vowel shown in isolation, and
rendering it with the added circle) has been "elected" here by the
implemention. In my own implementation of the same specifications, I
consider this is a perfectly correct and useful sequence (used in India to
teach the sillabary), so I do not insert the circle and as a result (with
Mangal) it is shown as you expect.

> So an author has to make some notational compromizes here. But still,
> I do think that using NBSP as this empty/null base consonnant before
> the dependant vowel will create the intended Unicode default grapheme
> cluster.

About NBSP: I hope Paul will read my other post (direct to Avarangal) and
will enhance Uniscribe on this respect, allowing NBSP to behave the same as
SPO on this respect. I am not sure here (one should look at Unicode 2.0),
but I seem to record the behaviour with NBSP has been added around 3.0, and
since Uniscribe has been designed against 2.0...


> Then it's up to the font or renderer to show the NBSP+vowel
> cluster properly, without the dotted circle, but it's not a problem
> of Unicode itself.


I am reading the Unicode list for quite some time (and sorry Philippe, but I
speak about time previous to when you came in). I do not know why, but every
now and then, there are comments from regulars that says "This is not a
defect of Unicode itself", even when nobody is even thinking such a thing.
On a psychological point of view, this is quite interesting. ;-)


> If dotted circles appear before the symbol, or if the symbol is shown
> with a square box for a missing glyph, it's not the fault of Unicode.

Again! ;-)


>> Also, something which is probably very relevant to Avarangal, fact
>> is the implementation from a major vendor in the field, Microsoft
>> Uniscribe, does retain the dotted circle (if present in the font; if
>> not, you would probably get the .missing glyph instead).
>
> I'm not sure that UniScribe is the cause of this problem.

I am pretty sure it is! Because if he were using Freetype, he would not have
any problem to display the standalone glyph. :-D

Something more complex would be to have some way to display *various*
representation of the dependent vowels; in Tamil U+0BC1 and U+0BC2, which
come to mind, show too much variation, there is not likely to have that one
glyph in the font. But for the well-known Burmese AA U+102C or in
Traditional Malayalam U+0D41 and U+0D42 this might be an open question.
âHere again, using Freetype this is perhaps doable, but with some
"higher-level" engine it would be much more complex. If the need for it
arises, probably the option would be to define a user-accessible OpenType
feature (of alternative kind).

> There just
> appears to exist no GSUB rule in some fonts like Mangal to handle the
> case of NBSP followed by a Indic vowel sign or combining character,

Well, we are quite away from the original subject, but anyway...
You are missing something important about the Indic OpenType specifications.
Besides, in fact before, the substitutions and after that the positioning,
which are encoded as TTO tables GSUB and GPOS, there are two stages called
"analysing" and then "reordering". Analysing deals mainly with splicing the
stream into clusters. Reordering then does a number of operations, and this
is this step that will insert the dotted circle. Or will not, depending how
it is programmed.

> I'm not an expert of UniScribe programming, but there may exist some
> Indic features in Indic fonts, which can be enabled in UniScribe to
> change the rendering behavior by includin

Re: Printing and Displaying Dependent Vowels

2004-03-26 Thread Antoine Leca
Philippe Verdy va escriure:
>
> Space is a base character, then it combines with the next diacritic
> with which it creates a "default grapheme cluster" which should be
> interpreted as if it was a single character identity.

Agreed so far for diacritics. Agreed also for non-spacing dependent vowels
like U+0BC0. Agreed for the special exceptions like u+0BBE. I disagree for
U+093F or U+0BBF (Mc not included in Other_Grapheme_Extend, there is an
allowed break before it), until there is something I missed here.

> It is NOT defective.

I do not understand. I did say anything implying that, did I? I just
remarked that I was not able to fetch in the text of the standard some words
to require from vendors and implementers (like I am) solid base to make them
modify their engines to provide special exceptions to deal with the
combination U+0020/U+00A0 then U+093F.

And no, this is not the same as displaying a diacritic, because it should be
re-ordered, rather than being a "spacing representation of diacritics".


> Now how would you interpret differently SPACE+diacritic or
> SPACE+vowel sign?

See above.

> If you display a dotted circle there, then you'll
> display two separate glyphs for a single grapheme cluster, and this
> is not intended by the normal Unicode character model.

?

How do you believe anybody will show say u+0063 u+0300? Which font have this
as a single glyph?

Furthermore, a single character like U+0916 (Devanagari KHA) is very often
rendered with two glyphs (namely, Half-Kha then the glyph also used for the
AA-matra, U+093E). Unicode does not enter into knowing how does this stuff
is handled.


Antoine




Re: Printing and Displaying Dependent Vowels

2004-03-29 Thread Antoine Leca
On Sunday, March 28, 2004 12:03 AM, James Kass wrote:
> So, if the question is how to make an OpenType font *not* display the
> dotted circle on Windows with Uniscribe, one idea would be to add a
> spacing glyph to U+25CC (DOTTED CIRCLE) in the font.

If you do so, you will end with defeating the normal behaviour that is to
draw a circle when someone makes an error while typing. Depending on the
intent of the font, it may or may not be a good idea.

Since Avarangal seems to be now under "non disclosure agreement" with
Microsoft, we do not know for sure what is his intent.
We also do not know if there are variations between releases (I hear there
are, but do not feel it is my job to investigate it), or generally what are
the real specifications in this area (the official being that the sequence
SP+ZWJ+some_mark renders without displaying the circle, but we know it is
not always enforced).

In the general case of a font intended for general use, and if the rendering
without the circle is intended in special cases like drawing a keyboard
layout for reference, I still believe it is better to have the circle and
resort to special manipulations, like SP+ZWJ+vowel or drawing directly with
ExtTextOut(ETO_GLYPH_INDEX), in order to draw the keyboard layout. At least,
because complexing a font to cure a defect into a version of one (the)
rendering engine does not seem to me an engineering solution. (I since read
your other post that rather seems to agree with me)


> Another approach is to simply use a non-OpenType Unicode TrueType
> font for Tamil.  The dotted circles don't seem to ever appear unless the
> font-in-use has OpenType tables covering the script-in-use.

Right. (The only remaining problem will then be the overhang and centering).


Antoine




Re: Printing and Displaying Dependent Vowels

2004-03-29 Thread Antoine Leca
On Monday, March 29, 2004 2:14 PM, John Cowan va escriure:
>
> The bottom line is that SP+vowel and NBSP+vowel are prescribed by the
> Unicode Standard,

I am sorry John, I should have miss a post of yours. I asked you where it is
written, and did not find any answer to this; unless someone consider that
all marks, including spacing combining vowels, are "(European) diacritics".

I did find some things in UAX29 about grapheme clusters (as indicated by
Philippe), but also found that Mc characters do not seem to be concerned
(Mn, on the other hand, seems to are). I now understand that any base
followed by a "Grapheme_Extend" are to be seen as a cluster. I found
"Grapheme_Extend" as being defined as Other_Grapheme_Extend + Me + Mn in the
UCD. (But was not able to encounter this in the standard itself. Never mind,
I should have miss something obvious.)

I am sorry to insist on these issues. I have really big problems to
understand where are the specifications, when chapter 2.10 inside the
Unicode book says one thing while dealing directly with the issue, while
another document that is supposed to be as standard as well, says otherwise,
or better is to be interpreted otherwise, and still none of them match
exactly with what people are expecting in this forum.

(And furthermore when asked about issues of conformance, the former answer
was, "it does not matter", or "it should not matter", or "depending on what
you are doing", etc., in a word ways to avoid answering the original
question.)


> if they don't work [...] the system is broken.

As James eloquently showed earlier today, I am not that sure we want things
this way.

The text in The Unicode Standard explicitely refers to the case of the
European diacritics. There (well, here!), because of typing habits (use of
so-called dead keys), users expects that combination of a diacritics and a
space is rendered as a spacing clone of the diacritic. I read the 2.10
snippet as guarding this convention.
(Of course, this is my interpretation, I can very easily be wrong.)

On the other hand, typing habits in other parts of the world are not that
entrenched. After all, dead keys are with us for more than a century, while
keyboard for combining characters that may reorder before the preceding
characters are only twenty years old. Furthermore, custom is to provide
disambiguating ways, such a bell (Thai) or a dotted circle, when a vowel is
mistyped. Evidently, Microsoft did follow this when they designed
Uniscribe/Indic OpenType. What you are saying is that when a mistyped vowel
follow a space character, it should appear hanging from nothing, while
situation will be different is typed after virama, or another vowel, or some
other mark.

As I said, I am not sure this is what we really want.


Antoine




Re: Printing and Displaying Dependent Vowels

2004-03-30 Thread Antoine Leca
On Monday, March 29, 2004 8:11 PM
John Cowan va escriure:

> Well, it depends on what the equivoque "combining marks" in the title
> of Section 7.7 means.

Ah! This is the place where I did not seek into! (It was not obvious to me
that text about the dependent vowel marks has to be searched into the
European alphabetical scripts section! But as Ken pointed out elsewhere, I
should have known better: Obviously, one must know the whole standard text,
and the history of it, before making any assumption about signification of
any given section: after all, this is not an ISO standard.)

Many thanks John for pointing this out.


> This is where (p. 187) the remarks about SP and NBSP appear:
>
> # Marks as Spacing Characters.  By convention, combining marks may be

OK, this one says it should applies to all combining, and does not make any
distinction between spacing and nonspacing. So the issue appears now clear
(and we implementers of rendering tools have now work to do, haven't we?)

Now I will fill erratum reports for all the discording things I have found.


Antoine




Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

2004-03-31 Thread Antoine Leca
On Tuesday, March 30, 2004 11:42 PM, Ernest Cline va escriure:

> The main usage is with compound words such as "ice cream" or
> "Louis XIV" or commercial phrases such as "Camry SE" where for
> esthetic reasons an author would prefer that the space not expand
> upon justification,

Well, as one that takes the pain to enter ALT+0160 here and there
(particularly around « and » in French), I should say that I certainly would
like the space between Louis and XIV, or between Camry and SE to stay of
fixed width; on the other hand, I would expect the one between ice and cream
to expand according to the rythm of the paragraph, in order to not break the
reading. Like in

Mum,   I   want   an   ice   cream

against

Mum,   Iwant   anice cream

> I am not aware of any style guides that offer either
> normative or informative guidance for either choice.

The French guides of styles (after all, we can use Unicode to write French
as well as English, can't we?) generally say that NBSP should not be
expanded on justification. I do not know right now (I miss access to
definitive references) if this is general to all non-breaking spaces,
including those that do have fixed-width per se, or if it specifically
applies to U+00A0. It should be outlined that non-breaking spaces occur
rather frequently in French (around several punctuation characters), and
because many word processors are not rich enough to encode it as it should
(i.e., as ZWNBSP+THSP+ZWNBSP, \uFEFF\u2009\uFEFF), well they encode it as
U+00A0 :-(.


> NBSP ZWNJ breaks, but should it justify?
^^
This is an error, isn't it?


Antoine




Re: French typographic thin space (was: Fixed Width Spaces)

2004-04-01 Thread Antoine Leca
On Thursday, April 01, 2004 12:37 AM
Asmus Freytag <[EMAIL PROTECTED]> va escriure:

> Have you folks noticed the addition of Narrow Non Break Space?

No, I did not. In fact, when I saw your message, I believe it should be a
character whose code would be 0401 or somethink like that. ;-) I know it is
not (U+202F or  ).

Is it intended (in part) for French typography?

And if the answer is yes, when will it be supported in many fonts and
rendering tools? It certainly will lead to much better aesthetically
pleasing documents than present  ...


Antoine




Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

2004-04-02 Thread Antoine Leca
Arcane Jill wrote:
> There were sixteen block-graphics characters, remember?
> They each were subdivided into four quadrants, each of
> which could be either black or white, according to the
> low order four bits of the codepoint. The all-white
> block-graphics character was visually indistinguishable
> from space, but was NOT space.

This reminds me of a similar character in codepage 437. Was encoded as 255.
Similar uses. I believe it has been unified with U+00A0.


Antoine




Re: U+0140

2004-04-16 Thread Antoine Leca
On Thursday, April 15, 2004 8:16 PM, Philippe Verdy va escriure:
> I thought it was already answered in this list by a Catalan speaking
> contributor: the sequence L+middle-dot in Catalan is NOT a combining
> sequence.

No? Then was is it?  Looks like very much one, to me.

> The middle dot in Catalan plays a role similar to an hyphen
> between syllables, to mark a distinction with words where, for
> example a double-L would create an alternate reading.

Yes (although I am not sure we can write "similar to hyphens", since I do
not know the history of the hyphen).

> The dot indicates that each L must be read distinctly (or read
> with a long or emphatic L).

Ought to. I.e., it would be precious prononciation, at least for the
Barcelonian way of speaking. In other places, the prolongated prononciation
may be the default for litterate speech, too (this is the case here in
Valencia). Colloquial speech definitively makes no difference between l·l
and l.

The very reason for the dot is to disambiguate between two identical
orthographies inherited from the past, without actually changing the
orthographies (i.e., dropping one l, or adopting the standard but bulky "tl"
digraph).
So, "ll" now unambiguously designs palatal l (the IPA code of which I am
presently unable to found in Unicode, it is a turned y), coming form
colloquial words, while "l·l" unambiguously designs may-be-prolongated [l]
directly coming from Latin. Before the reform (~100 years ago), both were
written identically, which leads to problems.


> In French for example we have words like "maille" to be read as
> /maj/, and the same "-ill-" written diphtongs after another vowel
> occur in Catalan.

It is written -i- (not ï nor í), occuring after some vowel. Like "mai"
(never), which is sounded the same as "maille" in Parisian French.

> But French will not write "-ill-" if it occurs
> between two vowels where the two L must have the sound L (if this
> occurs in french, only 1 L is written, and the emphatic/long sound is
> not marked).

Of course not "-ill-" (why on earth someone will introduce an -i- where
there is no reason for it?), but rather "-ll-", like in "collège" or
"parallèle". TWO L's ;-). This is after the two most used words in Catalan
that have the ·, namely "col·legi" and "paral·lel".

And yes, similarly to Catalan, the emphatic/prolongated l sound is not
usualy marked.


> Catalan has this orthograph, and writes the
> emphatic/long L distinctly. So it needs a symbol for that. The
> middle-dot is then considered in Catalan as a letter,

This is not a letter. Not as much as harly anyone will consider apostrophe
as being a letter in Romance languages (or in English either).
Note that I am _not_ saying · is like an apostrophe in Catalan (the latter
is a punctation symbol, which separates words). But it is not a letter.
Neither are ´ or ¸, either.

> that will occur in the middle of words.

Specifically between L (either lower or upper-case, but not a mixture).
There are other rules, too, such as IIRC the letters surrounding the l
should be vowels (Not 100% sure here, and did not care to check).


> I don't know if the middle-dot can be used in Catalan as a cadidate
> position for a line break with hyphenation:

It is.

> if yes, is it kept before
> the hyphen, or is the middle-dot used alone, or is the middle-dot
> replaced by a regular hyphen?

The latter.

> I don't know. But if the middle-dot
> must be replaced by a hyphen, then it is a punctuation (similar to
> hyphens used in compound-words).

What is the first k in a hyphenated "dicke" in German? (it becomes
"dik-ke"). At any rate, I will not tag it as "punctuation"!
Here we are a similar case: when l·l is hyphenated, the former "diglyph",
i.e. "l·", is transformed to "l". The obvious reason is that there is no
more need to disambiguate, since a palatized "ll" will never be hyphenated
in Catalan (nor in Castilian, nor will "lh" in Portuguese or Occitan, nor
will "gli" in Italian).


> But in Catalan, the middle dot should not be kerned into the
> preceding uppercase L, like it would appear if it was considered
> equivalent to .

Sorry, but who are you to dictate laws about kerning in Catalan?
Kerning is essentially an optional feature related to fonts, and I do not
see any reason to avoid "kerning" a L and a · (which would be in a title,
moreover), if the result is aesthetically unpleasant, perhaps becasue the
font designer did not consider the case.


> If there's something really missing for Catalan, it's a middle-dot
> letter with general category "Lo", and combining class 0 (i.e. NOT
> combining). It's unfortunate that almost all legacy Catalan text
> transcoded to Unicode are based on the middle-dot symbol (the one
> mapped in ISO-8859-1 and ISO-8859-15) which is not seen by Unicode as
> a letter (Lo) but as a symbol only.

Considered that the · is present on any Spanish keyboard these days (shift
3), and that on the other hand almost no keyboard except ancient typewriters
do h

Re: U+0140

2004-04-16 Thread Antoine Leca
On Friday, April 16, 2004 12:31 AM, Peter Kirk va escriure:

>> Peter Kirk a écrit :
>>
>>> What is U+2027 intended for? The name suggests that it might be what
>>> is needed for Catalan.
>>
>> Hyphenation point is primarily used to visibly indicate
>> syllabification of words. Syllable breaks are potential line breaking
>> opportunities in the middle of words. The hyphenation point It is
>> mainly used in dictionaries and similar works. When an actual line
>> break falls inside a word containing hyphenation point characters,
>> the hyphenation point is rendered as a regular hyphen at the end of
>> the line.
>
> Well, this sounds just like the required behaviour for Catalan, as
> described by Anto'nio Martins-Tuva'lkin on 28th March. He wrote:
>
>> Something happends when the "L·L" coincides with a soft line end. I'm
>> no expert in Catalan typesetting but IIRC the dot becomes a hyphen,
>> while regular "LL"s cannot be broken.

António is correct.
But this is not the main point of ·. Main point for · is to disambiguate
orthographies. Hyphenation behaviour is only a secondary role.

Besides, it is vastly more easy to keep the obvious unification, rather than
trying to distord it and trying to make a conditional mapping, if
Mathematics, · => U+00B7, if Catalan, · => U+2027, if NoSeQue, · =>
some_other_random_middle_dot, etc. Unlike hyphenation rules (where the
mapping might very well be · => U+2027, by the way), which are pretty easy
to pinpoint, tagging Catalan in bulk text is clearly not a easy task. Even
when considering the fairly restrictive rules for it to occur (requiring
NFC):
/[aAàÀeEéÉèÈiIíÍïÏoOóÓòÒuUúÚ]l·l[aàeéèiíoóòuú]/
/[AÀEÉÈIÍÏOÓÒUÚ]L·L[AÀEÉÈIÍOÓÒUÚ]/

Antoine




Re: U+0140

2004-04-16 Thread Antoine Leca
On Friday, April 16, 2004 3:26 PM, Ernest Cline va escriure:

> I don't see that as being any worse than the set of HYPHEN_MINUS,
> HYPHEN, MINUS SIGN, etc.

Sorry, I did not make me clear. I am not intenting to say this is undoable,
nor that · case is particularly complex. It is doable (as I showed with the
regular expressions), and it is NOT complex.

I was just saying this is presently not done, and it is IMHO not worth
doing.


> Given the nature of U+0140 (and U+013F) when hyphenated, might it
> not be a good idea to assign these two characters their own Line
> Break class for  the Line Breaking Algorithm of UAX #14?

I do not know if it is a good idea or not (I am not the guys who can argue
on this; furthermore these characters are very infrequent), but your
understanding of the behaviour is correct.


Antoine




Re: U+0140

2004-04-16 Thread Antoine Leca
On Friday, April 16, 2004 12:37 PM, Philippe Verdy va escriure:

> In some future, we could see U+013F and U+0140 used more often than L
> or l plus U+00B7...

I (personally) hope we would not.

> Notably in word processors that can detect these
> sequences in Catalan text and substitute them with the ligatures,
> which create a more acceptable letter form and allows easier text
> handling for (e.g.) word selection in user interfaces and dictionnary
> lookups.

As I wrote earlier, if you know the text under inspection is Catalan, a very
simple regular expression will deal with that. Any half-decent Catalan word
processor do it already, by the way.


> The fact that there's no such L-middle-dot on keyboards should not be
> a limit: word processors have more key bindings and more intelligence
> than the default keys found on keyboards.

Yes yes yes. Particularly when I want to insert afterwards a · between two
ll, when it appears I missed it on the first shot (yes, it happens). Or when
I want to remove a superfluous one that I typed by mistake (yes, it happens
too). With your "intelligence", this latter point will prove being a
headache: on the first shot, a normal user will place the caret just after
the dot, and press Rubout. Slurp, the whole U+0140 is swallowed, but usually
the user will not notice it. So at the second sight (perhaps a lot of time
after, perhaps after an useless additional printout), she will have to type
in the first l.

Intelligent keyboards might be great. But to be so, they have to bring
*much* added value (like, obviously, to be able to type in a language
impossible otherwise; or, more simply, to avoid typing every five minutes
Alt+0156). If they bring only very little value, they are more annoying that
anything else, particularly when they are non permanent but rather operate
from time to time. This would be the case here: as Catalan writer, I type
about texts sometimes in the word processor, where I would be "helped". And
sometimes in the mail reader, or on the console, where I would not, for
example because I do not want to wait two full minutes for the whole
"helpers" to come in everytime I have to type the name of the user of a
given process...


> When I see a Catalan word coded with  it looks very
> ugly (notably with monospaced fonts or in Teletext) and I'm sure that
> Catalan readers don't like the default presentation.

Yes it looks ugly. But this is in fact less ugly for me than seeing l.l or
l-l. Ugliness is in the eye of the beholder, of course. When you are in the
habit of seeing about every hour some rendering of l·l, you will not notice
it. And in fact, I notice more when someone use the kerned version advocated
by Gabriel Valiente, because nowadays it is unusual. And I certainly would
not use the kerned version for some institutional version, because I do not
want to incommodate my readers (this problem showed up about 20 days ago
between us; and there were no debate).


> They will much
> appreciate the support for the ligated 
> encodings.

What do you prefer?

  El col·legi Miguel Hernández de Riola?

  El co[]legi Miguel Hernández de Riola?

([] is ASCII art for a box, which is how many many people would see any use
of U+013F...)


> I don't think they can be considered "compatibility
> characters" just introduced for compatibility with a past ISO
> standard for Videotex and Telelext.

Sorry, you are fighting a lost battle: everyone here do not use them, so all
the corpus is already encoded without them.
The mills of Don Quixote are in Mota del Cuervo, it is only about 200 km
from here, but this is not the Catalan-speaking region ;-).


> The only safe way to change things would then be to have a middle-dot
> diacritic (combining but with combining class 0) to be used instead
> of U+00B7, even if there's no canonical equivalence with the U+013F
> and U+0140 ligatures... A Catalan keyboard would then return this new
> dot instead of U+00B7, and word processors or input method editors
> would easily find a way to represent it using the ligature when it
> follows a L.
[snip]

May I suggest U+1000B7 for this new character?


Antoine




Re: U+0140

2004-04-20 Thread Antoine Leca
On Saturday, April 17, 2004 10:28 PM TU+1, AntÃnio Martins-TuvÃlkin wrote:
>> As I wrote earlier, if you know the text under inspection is
>> Catalan, a very simple regular expression will deal with that. Any
>> half-decent Catalan word processor do it already, by the way.
>
> What about the odd Catalan phrase within a text in Guarani or
> Cherokee?

Then, you do not know the text under inspection is Catalan, the "if" is not
asserted, so you are not supposed to act accordingly. That is, nobody will
beg you because a double click on colÂlegi does not select the whole word;
and any reader can test its own word processor, please double click the
Catalan word before, and test if it is recognized as such, even if
surrounded by bad English instead of Guarani!

> Unicode, do not forget, supposedly brings correctness to
> multilingual text...

And then?
Would you try to say that selecting word in multilingual text should always
do the "right thing"? You were merely dreaming, I believe; and you know it
perfectly; having posting less than 2 minutes ago the case of apostrophes,
which is about impossible to sort out in the average multilingual text.
Furthermore, what is "the right thing" varies from people to people, so
achieving perfection here is a mere dream.

Or are you trying to make the point that inventing a new point for  in
Catalan would bring any added correctness to multilingual texts?


It is certain that the compatibility encoding of U+0140 is not very welcome
from my eyes, since:
 - it is almost unused, but for the case it might be, informaticians like me
do have to check for it: so it is just a waste of my time, I would say :-(
 - one that reads TUS and does not know Spanish uses at the respect, might
think that colÂlegi should be written coÅlegi, "co\u0140legi", because the
former is not listed as a letter, and only the latter references itself as
"Catalan", without mentionning the "right thing to do"
 - the only advantage I am able to see, namely that the typographers will
design the mid dot raised in U+0140 relative to the position it has in
U+00B7, is not exploited in practice; we even see a lot of fonts where the
dot in U+0140 is not balanced between the l, which clearly show that the
majority of typographers have no idea about the use of this character, and
they probably merely build it a compound of U+006C and U+00B7... Others use
a reduced size for the dot in U+0140 (which is unpleasing to my eyes). Only
a few fonts do provide U+0140 with a reduced width for the dot, which might
be considered good typography.

Further note about typography: I have compared on some (widely available)
fonts the layout of Ål versus lÂl and also the upper dot of the colon. I
found that almost nobody use the upper dot of the colon. One of the few I
found, namely Linotype Palatino (I cite it since I generally consider it a
nice design), does use the upper dot of the colon for Å. And the result is
really ugly, because the dot is way too high (about 65% of l-height), thanks
to the modern habbit of the higher x-heights...


Antoine





Re: [OT] Even viruses are now i18n!

2004-04-22 Thread Antoine Leca
On Thursday, April 22, 2004 7:14 PM
Peter Kirk <[EMAIL PROTECTED]> va escriure:

> The virus writers have presumably confused
> .tc and .tk

.TR for Turkey. .TK (Tokelau) is not more sensible


Antoine




Re: Common Locale Data Repository Project

2004-04-23 Thread Antoine Leca
On Friday, April 23, 2004 7:02 AM
Peter Constable <[EMAIL PROTECTED]> va escriure:

>> due to the strong perception of OpenI18N.org as
>> opensource/Linux advocates, even though CLDR project is not
>> specifically bound to Linux.
>
> It is hard to look at OpenI18N.org's spec and not get the impression
> that all of that group's projects are not bound to some flavour of
> Unix.

While CLDR certainly originates _from_ the Linux community, it is not
_bound_ to it. That is, as far as I understand, it is the same datas as what
use ICU, and to my knowledge, ICU "runs" also on Windows, which is under no
way "bound to [that] flavour of Unix."

Or are you saying that, in as much some are advocating that everything from
Microsoft is so much evil that one should not even touch it, everything that
originates from Linux is not pure enough to be run on other systems?  :-)


> The "Scope" clause for several sections are specifically
> expressed in terms of Unix-related implementations (e.g. having the
> scope for rendering requirements expressed as what is needed for X
> Window).

Where are these clauses?
By the way, X Window, while Unix-related, is not bound to it. For example, I
ran for years a X client on a Windows desktop OS, with the server running on
another non-Unix machine. In fact, we did that because the equivalent
technology from Microsoft was at the time, emh, not very mature...


> And even if a section isn't scoped specifically in terms of a
> Unix-derived platform, it may specify requirements that are explicitly
> related to Unix implementations (e.g. that base libraries must support
> POSIX i18n environment variables).

Again, where is it said that CLDR require any form of "base libraries", much
less one that support POSIX variables?


Antoine




Re: [OT] Even viruses are now i18n!

2004-04-23 Thread Antoine Leca
On Friday, April 23, 2004 2:08 AM, Philippe Verdy va escriure:

> From: "Antoine Leca"
>> On Thursday, April 22, 2004 7:14 PM
>> Peter Kirk va escriure:
>>
>>> The virus writers have presumably confused
>>> .tc and .tk
>>
>> .TR for Turkey. .TK (Tokelau) is not more sensible
>
> Or is that [tk] for Turkmen (the language code in ISO 639-1)?
> Not to confuse with [tr], the ISO 639-1 code for the Turkish
> language...

The virus cannot have any knowledge of a language code. And much less of the
language used by its next victim...


> May be it's time to get into the new CLDR repository if you don't have one
of
> the many copies of the ISO 3166 country/territory codes list, and of the
ISO 639
> language codes, and the ongoing ISO 3066 locale codes?

Thanks for the advice. When I finally succeed at resurrecting my 1995 disk
that hold a copy of Keld's repository, I will try to make profit of it.

BTW, my own copy of 3166 is in fact derived from an rewrite of 4217, as
found in TDED, as used in EDI. It is a beautiful copy on green paper sheets.
It is dated 1986, IIRC. So it probably predates all these "new"
repositories, sorry Keld ;-).


> Never forget that language codes and country/territory codes are
different...

We were speaking about ccTLD. A different beast. Try to resolve ANYTHING.GB.
on a root server, or alternatively to seek UK in ISO 3166, to understand
what I mean.


Antoine




Re: [OT] Even viruses are now i18n!

2004-04-23 Thread Antoine Leca
On Friday, April 23, 2004 3:05 PM, Marco Cimarosti va escriure:

> Antoine Leca wrote:
>> The virus cannot have any knowledge of a language code. And
>> much less of the language used by its next victim...
^
Oops: I forgot to repeat "code" here. Looks like it confused people.


> It sends e-mails to addresses stolen from the previous victim's
> address list, so it can analyze the top-level domain of these
> addresses (".it", ".fr", etc.).
>
> Although, strictly speaking, these
> domains normally correspond to *country* codes, they are a pretty
> good hint of the language of the next victim.

I know that (see the rest of my post). Spammers too, as it seems...

This is really a hint for the language. Not knowledge of its code.
I was corrected on the point that .TK did not stand for a TLD (my point),
but rather was a language code, Turkmen, which happens to be related to the
base language.

Let's drop Turkish for a while, and switch to Swedish, where I feel things
are clearer. ccTLD for Sweden is ".SE" (which is the value engraved in the
virus); language code is "sv" on Unix, 0x01D on Windows, or 5 on MacOS. My
claim is the virus does not have knowledge, at any point, of the latter
codes. Nor it needs it.
As you describe, it uses the former, deduce that the language of the elected
victim is likely to be Swedish, and compose the body of the mail in Swedish.
No need to use "sv" here.


Sorry if it was unclear first hand.


Antoine




Re: TR35

2004-05-12 Thread Antoine Leca
On Tuesday, May 11, 2004 6:59 PM, Philippe Verdy va escriure:

> From: "Carl W. Brown" <[EMAIL PROTECTED]>
>> Expats break the locale model anyway.  The problem is that we use
>> country as both a language modifier and a location.
>
> From past comments I read here, it is understood now that locale
> identifiers used to select languages contain a country/territory code
> only as a legacy way to select language variants.

I disagree. You are seeing the locale identifiers just in the context of
language tagging. It is not its primary use, nor is it the historical one,
neither the most proeminent.

Main usage for locale ids nowadays is to resume all the i18n settings in an
environnement. And certainly i18n settings depends on the language, but also
on the territory you are in. When you cross the border between Italy and
Slovenia, or between Ontario and New York, the most striking difference is
not the orthography or the pitch, but rather the coins.

Then, main variations within a language have been historically identified
with countries. This might be related to the common practice from States to
affirm its independance by drawings laws on this respect. It might also be
related to the current state of orthographies between both sides of Atlantic
Ocean for some important languages (and even more when we consider the
situation 20 years ago.)

Whether this perception is correct as "first tie", or if it should be
replaced by another (which one?), I cannot say. What is certain is that it
is not universal.

Now, the two points (locale identifiers characterizes language and
territory, and languages are usually partitioned with territory information)
did interfere during the last decade (certainly RFC 1766 and 3066 might be
related to this process.) Carl's point, and I believe he is correct, is just
that these two meanings should NOT be mixed. And that when we spoke about
locales, the relevant one is the first one (the part you snipped.)

> This code is meant
> to designate the language variant as spoken in that area, but not for
> identifying a location.

I am very sorry, but if in

LANG=fr; LC_MONETARY=es_ES

you consider that _ES above is a language variant of Spanish Castilian as
different from Hispanoamerican, you are deeply wrong.


> However the set of variables in POSIX is not rich enough or tweaked,
> because a single LC_ALL variable can override all these variables.

You are completely distording the model here.
The normal setting is as above: LANG, then LC_xxx where LANG is inadequate.
LC_ALL is an alternative way, that allows a _supplementary_ level. This is
very useful when you have to temporarily override the setting (please
remember that POSIX is initially console-oriented), because this way you can
with not too much keystrokes specify a desired behaviour for a given action,
like it

LC_ALL=POSIX cc myStrangeProgram.c


> This means that all settings what can be defined in a locale must be
> definable with the same identifier.

No, it does not _mean_ that. No obligation here.
Anyway, the general way to implement the standard C setlocale() is just
that, an identifier (not even human-readable, that is not its point) that
groups all settings.

If a Taiwanese sets in .profile

LC_ALL=zh_TW; export LC_ALL

and then complains the locale model is wrong, everybody, you included, will
tell him that what is primarly wrong is her setting.


> Now a good question is: can all settings in locales be selective
> enough to allow specifying correctly the possible values.

Define "possible": are you writing about the set of already described
locales? (the only useful, as Carl wrote, en_GU is essentially non-existent;
same for 0x180c)
Or about all the potential possible values, including pro_QQ for Occitan as
used within the Chancellery of Toulouse?


> Is the POSIX syntax enough for them?

Since it exists an extension to it in ISO/IEC TR 14652, answer here is
probably no.


Antoine





Re: TR35

2004-05-13 Thread Antoine Leca
On Wednesday, May 12, 2004 8:00 PM, Peter Constable va escriure:
> It's not particularly useful to communicate that a document was
> created when a locale with such-and-such number format was in effect,

Sure?

: Please send to us 100.000 units of your item 12010, available to our
: warehouse by 6/7/04. We agree with the current tariff.

Now it happens that I do NOT have such item 12010, only 12001 or 21001. And
with the former, 10 may take sense, and 100 definitively does not. But
with the latter, 100 takes sense, 10 is probably too much (and anyway I
do not have that much merchandise available.) Units may be kg or t, in fact,
so 3 decimals is adequate. What should I send? When?

Of course, the guy is away from office, cellphone is down, etc.


Well, it is true that what I really search for is not *exactly* the
formatting locale, but rather another wider information, which would be the
mind setting of the writer. But if the document happens to carry the locale
it was formatted with, then I have an hint about its correct meaning.

I agree beforehand that the locale id would not be a certain answer, just an
hint. This might not be what you had in mind.


I have another example, but I cannot expose it here publicly, it is related
to some proprietary software. Let just say that the knowledge of the locale
under which the document was created/formatted, was a preceptive knowledge
to be able to render it correctly.


> because that only meant how automated processes would format numbers,
> the author can choose to do something else, and the document can even
> use multiple formats: 1,234.56 as well as 1.234,56 (and it's not hard
> to imagine how the two formats might have been automatically added to
> the document at different times). Moreover, you would never label a
> document for a number format in order to determine how
> automated-formatting of numbers should be done on the receiving
> system.

I do not know about Mark, but at least I did. Now with EDIFACT there are
agreements to avoid possible misunderstandings (so the tagging results
useless, in fact it is already done at a superior level), but it was not
always the case. And I did see, and even make, processes that deals with
similarly tagged datas.

For a nowadays example, think about an i15d standalone program that emits
checks. I would expect such a program to be subsumed with a given locale
(according to the nationality of the check to emit), then fed with the
correct datas. Now, if the subsuming process is itself a generic one, it
will itself be fed with datas labeled with the format to be used.


Of course, we are very far away from Unicode here, even further from plain
text such as Ken asks us to stick with. Clearly, the locale ids here are
attributes, and even have almost nothing to do languages, so it might be
inappropriate for CLDR as well (this is obscure to me at the moment.)
That is just to say that while I agree with the fundamental of your
distinction, I also believe that the fact that locales have been "reduced"
(historically for the need of APIs) to locale ids, did then allow to use
these to tag documents. And while one may argue this is "bad", there is also
no way to stop people doing so...


Antoine





Re: TR35

2004-05-14 Thread Antoine Leca
On Thursday, May 13th, 2004 16:40, Peter Constable wrote:
> Only that I don't think it's appropriate in general to tag
> documents (by which I don't mean an accounting spreadsheet or an
> order-entry record) for things like number formatting, and so such
> info should not be included in attributes like xml:lang.

I am sorry I had misunderstood the whole discussion then.

To me, documents encompassed any style of writings (and was broader). For
exemple, I believed that writing was invented 6 millenaries ago precisely
for accounting and trading, *not* with the Hamurabi codex or the Egyptian
hymns. But it appears I was wrong.

> If something is going on internal to proprietary software, then there
> are no rules.

I also missed that the difference between language ids and locale ids only
mattered when used in public documents in published standardized formats,
and that private formats or any out-of-band tags, persistant or not, are
irrelevant here.

So please ignore my points.
Of course when we consider only the legal texts where all months shall be in
full letters, all quantities  spelled twice, one with numbers and the other
with letters, and the timezone rules explicitely deferred to some authority,
you are very right. And then the example from Mark is just garbage, as many
people would see it (replace "garbage" with "unreadable" if you are not
happy with that word); so it is not a "document" any more, and this would be
discarded as well.


So I beg your pardon having abusing your time.


Antoine





Re: TR35

2004-05-14 Thread Antoine Leca
On Friday, May 14, 2004 3:30 PM, Peter Constable va escriure:

>> To me, documents encompassed any style of writings (and was
>> broader). For exemple, I believed that writing was invented 6
>> millenaries ago precisely for accounting and trading, *not* with the
>> Hamurabi codex or the Egyptian hymns. But it appears I was wrong.
>
> If you get a clay tablet with some type of inventory on it and encode
> it digitally, presumably there are names of things, and numbers,
> perhaps also dates. Let's suppose you encode the text into a digital
> document. You assign a metadata tag indicating that the "language"
> (linguistic variety and writing system) is such-and-such. How would
> it be useful to also assign metadata to indicate what the number
> format is?

I do not know, I was not thinking about that.
I wrote about an electronic document, sorry, file, I might receive
containing an order form, and you said documents did not encompass order
forms, as I read it. So my example is void. My error was that I was
considering "accounting spreadsheet or an order-entry record" as documents,
while you do not. And my mistake was based, I think, on a faulty
interpretation of the history of writing, as I wrote.

Now, the actual content of the clay tablets is irrelevant (I think).



>>> If something is going on internal to proprietary software, then
>>> there are no rules.
>>
>> I also missed that the difference between language ids and locale
>> ids only mattered when used in public documents in published
>> standardized formats, and that private formats or any out-of-band
>> tags, persistant or not, are irrelevant here.
>
> If something is internal to your process, who cares but you what is
> happening?

I am basicaly an user. My "process" are procedures, the objects they deal
with are, among others, electronic documents, sorry, files, a number of them
with proprietary formats that I can (partially) decode. And these files do
include or refer locale ids and language ids, sometimes named one for the
other BTW.
My process is very different from yours. And what you see as "internal to
your process" is, to me, actually an usable, external, data. See my example,
imagining it is a text processing file: deeply inside, I have found the
locale id of the sender. Which was an hint, not the real data I would have
liked.

To be able to have my job done, I sometimes (often, in fact) have to use
different softwares. I understood CLDR as being a way to establish a common
ground for these softwares to interoperate, the same way the ONLY purpose of
Unicode is to allow various softwares to interoperate. And it happens that
these datas (locale and language ids), hidden inside the proprietary formats
of the files, are the ones that will select the datas to be used. Since I
understand that I feel commited to participate to the debate. Now, one can
just deface me saying that I am not supposed to look at that, that the users
should restrict themselves to the next release of XML. This is equivalent to
say, users are not invited to the discussions about the tools they will use,
a very common behaviour of the computer people here in Europa, and a
behaviour I am very angry against (hence the sarcarms, for which I would
apologize).


Have a nice week end, folks (I wrote that, because I noticed Satursday is a
raging day for this list ;-) while I am disconnected for Internet, and much
more quiet this way. There is no sarcasm, it's sincere.)

Antoine





Re: lowercased Unicode language tags ? (was:ISO 15924)

2004-05-03 Thread Antoine Leca
On Monday, May 03, 2004 1:52 PM,  John Cowan va escriure:

> Antoine Leca scripsit:
>
>> Particularly when I read
>>Tags constructed wholly from the codes that are assigned
>>interpretations by this chapter do not need to be registered with
>>IANA before use.
>> inside clause 2, which otherwise says that the 2nd subtag when 2
>> letter designates a country, and also says that 3rd and next subtags
>> do not have semantical restrictions.
>
> All tags need to be registered in the RFC 3066 regime, except those of
> the following forms:  xx, xxx, xx-yy, xxx-yy.

I know what the intent is. But I cannot find the words for this.


I remember having considered (back in the days of RFC 1766) that one could
use something like "ca-valencia", without absolute need to register
it --since it is impossible; and the very nature of the RFC 1766 scheme,
with its 8 levels, appeared to me a real invitation to do things this way.
It appears at second read that things are a bit harder in this area with RFC
3066, but the words are not clear enough to dispell any use of similar
schemes; particularly since there are no other ways to do so.


Anyway, this is of no importance: this is not topical here, RFC 3066 will be
promptly obsoleted anyway, and the "Philips" draft does allow a way to do it
"cleanly" with the embeeded "-x-".

Sorry everybody for the noise.


Antoine




Re: Variation selectors and vowel marks

2004-04-30 Thread Antoine Leca
On Thursday, April 29, 2004 2:17 PM, C J Fynn va escriure:

> In font lookups, where a variant glyph form of a base character is
> displayed due to the presence of a VS character, the lookups for
> glyph forms of subsequent dependant vowel marks  will  be dependant
> on the variant base glyph (as long as the base glyph substitution has
> been applied first).

Then there is a easy "solution": to make the VS effect over the mark be of
more "priority" (that is, occuring before) than any subsitution that occurs
between the base and the mark.

We already do that with Indic scripts, and yes, it is already a nightmare.


Antoine




Re: ISO 15924

2004-05-03 Thread Antoine Leca
[ This is not copied to unicore, since I am allowed there. This is copyied
to ietf-language because the question was, but it may perfectly be filtered
out. ]


On Sunday, May 02, 2004 10:57 PM, John Hudson va escriure:

> In the code lists at
> http://www.unicode.org/iso15924/iso15924-codes.html the 4-letter
> script codes are shown capitalised, e.g. Arab not arab, Armn not
> armn, etc.. Is this intentional? Should the codes always be
> capitalised? Does it matter if they are not?

John,

I remember having a discussion about 4 years ago this about, regarding an
item of conflict between ISO 15924 (then pretty advanced) and a new code
list used for a similar use in Microsoft's and Adobe's proprietary
"OpenType". I am not 100% sure, but I even record you might be instrumental
in the design of this second list.

If I remember correctly, a good part of these lists were merged, which is
certainly a good thing, since we do not have any need for two concurrent
lists. In fact, I believed the intent on both parts was to merge. On its
part, ISO 15924 did change the codes it had for everything that were in use,
even including the Indian OpenType support which was still in infancy but
was shiping as part of IE5.

But then there was the point about capitalization. Following previous use in
Apple resources then TrueType, Microsoft designed its codes in all lower
case (which with Apple was reserved for the non-private specifications).
OTOH, Michael designed ISO 15924 to fit well with both ISO 639 and ISO 3166,
and choose title case. The discrepancy was seen then, but it was accorded to
do nothing, and that ISO 15924 will continue to use title case while
OpenType will use the same codes but transformed to lower case. And we all
agreed that any discrepancy about capitalization should not be meaningful.


Why is this issue coming back now?


Antoine




Re: lowercased Unicode language tags ? (was:ISO 15924)

2004-05-03 Thread Antoine Leca
On Monday, May 03, 2004 4:36 AM
John Cowan <[EMAIL PROTECTED]> va escriure:

> Philippe Verdy scripsit:
>
>> And there are also ISO 3166-2 codes for administrative regions in
>> countries (such as FR2B for the department of Haute-Corse in France).
>
> I think those are usually written FR-2B, though I do not have access
> to 3166-2 itself.

You think right. See p.23 of
http://www.iso.ch/iso/en/prods-services/iso3166ma/03updates-on-iso-3166/nli-
2.pdf (free access).


>> Languages need not only distinctions by countries but also by regions
>> in countries, if this is needed.  So Catalan in the Spanish Canaries

About as useful as ro-FR, and less probable...


>> would use the ISO3166 code "ESCI" after the language tag "es"

Please use "ca". I know that Catalan is as much "Spanish" as Castilian is.
But the guys that designed ISO 639 decided otherwise (and this is probably
more mnemonic for many people, many of them native speaker of "es"). So it
had to be this way.

>> (the complete code would be "es-Latn-ESCI"

This is completely frivolous.

>> or just "es-ESCI", distinct
>> from "es-Latn" which could be used also for Castillan.

Spelled Castilian in English.
Looks like Philippe is mixing Baleares with Canary here...


> Catalan is not Spanish, and has its own code.

Sorry to contradict you slightly, John. Please note that this issue is
sensitive for some Catalans here in Spain, so I mention it for the sake of
everybody here knowing it.

> RFC 3066 permits registration of sub-country codes if needed,

Sure.

> but they must be registered explicitly to be used.

Where is it spelled?

Particularly when I read
   Tags constructed wholly from the codes that are assigned
   interpretations by this chapter do not need to be registered with
   IANA before use.
inside clause 2, which otherwise says that the 2nd subtag when 2 letter
designates a country, and also says that 3rd and next subtags do not have
semantical restrictions.


Doug Ewell also wrote:
> There isn't actually such a code as ES-CI (note the hyphen, which makes
> it distinguishable from a 4-letter script code).

Yes it is. Spain (as well as other countries) does have two levels of
administrative regions encoded in ISO/IEC 3166-2, recognising the dual
structure of its regional systems, with at the same time 17 autonomous
regions as well as 50 (or 52) provinces. When the limits are the same, there
is no distinction in codes. When they the autonomia is larger, case of
Canary Islands which is made of 2 provinces, designers of 3166-2 invented a
2-letter code from the English name of the autonomous region, that means
nothing and is not used here in Spain, but serves to designates it.

So yes, ES-CI does exist, and indeed designates the Canary Islands.



Antoine




Re: Romanian and Cyrillic

2004-04-28 Thread Antoine Leca
On Wednesday, April 28, 2004 5:28 PM
Peter Constable <[EMAIL PROTECTED]> va escriure:

>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
>> On Behalf Of Antoine Leca

Waouh!


>> It is interessant to note that Microsoft did not endorse ISO 639 on
>> this regard, but sees Moldavian as being a form of Romanian, and
>> asks for use of "ro-mo" for the identifier corresponding to LangId
>> x0818.
>
> Yes, well, if you look at it carefully, you'll notice that the
> identifier "ro-mo" is effectively "Romanian - Moldavian" (mo is the
> ISO 639-1 ID for Moldavian; it's not the ISO 3166 ID for Moldova --
> if you interpret this as ll-CC, it would be Romanian spoken in Macao).

Sorry Peter, I do not buy this one. The page I have given lists all the
idents in lowercase, so there is no way to distinguish the fine prints as
you do.

Furthermore, they are intended for use as RFC 1766/3066 tags, and in this
context, at least 3066 unambiguously says that the interpretation is of the
code of the country. As a result, reading -mo as "Moldavian" is much more
wrong than as "Macao".

So almost any fine reader will got it as a typo in the documentation rather
than your readings.


> It's painful to see inconsistent info being put out. The reality is
> that Windows does not support 0x0818 (or 0x0819 for that matter).


What does mean "support"?
You probably mean that there is no strings for it in your LOCALE.NLS, or a
similar thing.

But I perfectly can tag some text in Word as "Romanian from Moldova" (or as
well as Rheto-Roman, even if this code, 0x0417, seems to have disappeared
from MS's charts for a number of years, at least for Windows... Never mind,
I will keep it in my own chart }:->)

And I would be very surprised if KERNEL32 refuse to load a resource which
would happen to have this LCID.



> I consider there to be no real difference between "Romanian" and
> "Moldavian".

But others may opine differently...

It is true that Moldova does not have much money.

But it is also true, on the same way, that Russia does.

So, who knows...


> As far as Windows is concerned, I'd expect Windows might
> at some point support "Romanian (Moldova)" but I wouldn't expect
> "Moldavian".

If you mean, Microsoft will resists as much as they humanly could, to
associate a new primary language identifier (like 0xe8) to Moldavian, I
probably can understand.

If you mean that a new language id (like 0xc18) could not be associated to
Moldavian, I would not bet.

If you mean you will never be forced to change the name for 0x818 from
"Romanian (Moldova)" to "Moldavian", I certainly would not bet.

And if you ever mean (I really do not think so) that you will never be
forced to grok "mo" as a RFC 3066 tag, then sorry, but I feel you really
should grok it (and for example look at updating
http://www.unicode.org/onlinedat/languages.html: Apple already grok it, or
so they say.)



Antoine




Re: Non-decimal positional digits; was: Defined Private Use

2004-04-28 Thread Antoine Leca
Also, before it was recognized that there are *also* used as decimal digits
(using some adequate substitute for the zero), Tamil digits 1-9 were seen as
part of a non-decimal-positional system. Nevertheless, they were given class
Nd.

By the way, if the Tengwar system is only duodecimal (as I think), *all* the
digits should then be given class No, none to Nd.


Antoine




Re: Just if and where is the sense then?

2004-05-05 Thread Antoine Leca
On Wednesday, May 05, 2004 5:29 PM, John Jenkins va escriure:
> I should point out, however, that the probability of
> getting the pre-X versions of the Mac OS to support new 8-bit
> character sets is exactly 0.

Would the various Indian scripts not yet covered by ILK, count as "new
character sets"?

Of course, it needs fonts, and I know they are currently lacking. But
assuming they existed...

Cost to produce appears low (a few resources), and the ROI might even be
positive. At least it seems so, looking from outside.


Antoine




Re: TR35

2004-05-18 Thread Antoine Leca
On Friday, May 14, 2004 10:22 PM, Peter Constable wrote:
> It is simply inadequate analysis of usage scenarios to say "an
> order form contains formatted dates / numbers / currency that need to
> be interpreted, therefore this document has a locale".

Sorry, you lost me. I do not know what "usage scenario" are. But if "usage
scenario" describes a workflow, if the workflow involve orders, and if the
amounts can be written in ambiguous form, I would have thought that, _at
some level of the modelisation_, some notion of locale might be present; and
then that a realisation (I hope you get my vocabulary of specification
right) might have an property "locale id" attached to the "order form"
document. This was the scheme I had in hand. Of course, it results that
"this document has a locale" is a shorthand.

Nevertheless, I did not deny your analysis. Rather, I pointed that I my
view, it would be wrong to think that "no document has a locale," which is a
quite different thing.

In the case it was not clear before, I agree that in most cases, they do
not.

> But if the  record is *not* in a
> neutral representation, then there are several other questions that
> need to be considered regarding how the string was generated, and how
> the receiver knows what was assumed by the authoring process.

Regarding you example: I do envision very well an application that will tag
the , and also the XML document, with some externally defined locale id
(and I do not mean language here). And I also have already seen a pair of
application doing similar things... Whether this is sensible or not is
another debate entirelly: I just point out it could be done.


>> And these files do
>> include or refer locale ids and language ids, sometimes named one
>> for the other BTW.
>
> Just because someone called the two the same doesn't mean that the
> notions are not distinct, and that it wouldn't be helpful for us to
> understand that distinction.

Again, I am lost: I did not say they are merged, just that some use the name
of the former to design the latter. Now, I can accept they may be in fact
the same thing, since I am not an expert of this field: just that for me,
they appear as different for the moment (and the more I read in this thread,
the more I stay on my initial idea that they are different.)



>> And what you see as "internal to
>> your process" is, to me, actually an usable, external, data.
>
> If you consider it external, then it is because you expect others to
> use what you put there, or you are using what others put there -- and
> so it is indeed external.

Yes, exactly.


>> See my example,
>> imagining it is a text processing file: deeply inside, I have found
>> the locale id of the sender. Which was an hint, not the real data I
>> would have liked.
>
> If the document includes an ID that indicates the locale mode that was
> set in the author's software when the author created that file, and
> you wish to use that as a hint to set a processing mode on your end,
> I have no problem with that; I have never said anything against that.

This is what I missed.
I claimed, this ID was considered (by me) as a locale tagging of the
document (see above my full reasonment). I never claimed it was intended
that way at the beginning, or in other processes, including the ones that
will follow the one of recognition of the intended meaning.

But in that particular process, it looked very much like a locale id tagging
a document to me.



> Rather, I'm saying that the conceptual model we have inherited from
> the past is inadequate, and that we need to adopt a more
> carefully-conceived model around which to design i18n platforms for
> the future.

This is starting to be interesting: we obviously will have quite of bit of
"backward compatibility" (in the minds of the people) to deal with, won't
we?

> And it starts by understanding that while they may be
> related, "locale" and "language" are conceptually two different
> things.

I never thought such a thing, did I?

OTOH, I acknowledged your terse description of the question as being a very
good thing (« ce qui se conçoit bien s'énonce clairement, et les mots pour
le dire viennent aisément » --the well understood would be explained
clearly, and the words to say it will flow easily-- sorry M. de Boileau for
the bad English translation)


Antoine





Re: ISO-15924 script nodes and UAX#24 script IDs

2004-05-18 Thread Antoine Leca
Philippe Verdy wrote on Tuesday, May 18th, 2004 12:24:
> Also there are differences in orthographs in the table lists:
> the plain text version and Table 2 use consonnants with dot
> below for the english name, but Table 1 use basic Latin
> consonnants (example for Malalayam).

I believe these are typos that you ought to specify exhaustively to Michael
to have then corrected.

It looks like to me that all the diacritics would have to be dropped in
English, and that a number of them escaped the net...


> Dots below are probably appropriate for the French name,
> not for the English one.

???

French usage has always been to "morph" the original name to suit French
orthographic rules.
OTOH, it appears to me (feel free to contradict me, and also to to point me
the epoch when these things did change) that English habits now is to follow
the native name and the translitteration rules. A good example I found
recently is the name of Cervantes' main work, which short name is "Don
Quixote" in English, the same as it was in (original) Castilian, while at
the same time it was adapted in French as "Don Quichotte" (same
prononciation as original), and similarly in today's Castilian "Don Quijote"
(with subsequent change in prononciation.) I do not know how English natives
will pronounce it, however.

Another point is that the reference work about scripts in French are for a
good part old-fashioned, while at the same time recent English references
seem to abound. I may be biaised here (I surely am, in fact), but it appears
to me to represent a certain evolution in the world use of languages in
scientific works along the last century...

As a result, when we build the tables for 15924, we choose to have the
French name to represent the widely used practices, with the obvious
conventions (like ^ for the lengthned vowels) and some long-used ones, like
ç for s in Indian scripts (but ch for ? since it fits the need well.) But we
do avoid all the "strange" characters. The case of ? in Malaya?am (or O?iya)
is exemplary : the sound does not exist in French, and about no Frenchies
will know how to say it correctly. Furthermore, I highly doubt that the most
immediate feeling of a litterate French when he sees subscripted dots would
be to imagine the retroflex feature this convention implies... So we
followed current practices and droped all the subscripted dots.

We do keep a number of strange spellings for the alternate variants (between
parenthesis), particularly when usage was not fixed (I particularly record
about Cham this about.)



Antoine





Re: [OT] English pronunciation of Quixote (was: Re: ISO-15924 script nodes...)

2004-05-18 Thread Antoine Leca
On Tuesday, May 18, 2004 5:34 PM, Doug Ewell va escriure:

> Staying out of this thread probably won't help it go away, so...

;-)
The change of suject is adequate, anyways.


> This seems fair.  Even if there is a Spanish adjective "quixótico" --
> I found only one Google hit for it in Spanish, but many in Portuguese

Sounds fair, since the "x" meaning [?] (the original sound) did stay in
Portuguese when it faded away in Spain (XVIIth c. as I understand it.) To
stay on the funny tune, the French variant is « don quichottesque » or «
donquichottesque ».

I should thank John and Mark for the quick answer. By the way. when Mark
wrote "Donkey Hotay", it took me a minute to realize the meaning of its
post, and at first sight I was thinking about Sancho's mounting!  8-|


Antoine







Re: ISO 15924 draft fixes

2004-05-20 Thread Antoine Leca
[Mailed _and_ posted to the list; UTF-8]

On Wednesday, May 19th, 2004 10:40 PM, Michael Everson wrote:

> I would appreciate it if interested persons could look this over and
> inform me if they find any further discrepancies between the two
> which are worth troubling about. Then we will proceed to generate the
> other files.

The French name for Hang looks strange. It happened to be "hangul (hangul,
hangeul)" (after quite a bit of discussion.)

Antoine





Re: ISO 15924 draft fixes

2004-05-20 Thread Antoine Leca
> Antoine Leca a écrit :
>
>> The French name for Hang looks strange. It happened to be "hangul
>> (hangul, hangeul)" (after quite a bit of discussion.)

Sorry guys. For reasons known to itself, my mailer refused to post in UTF-8
this morning. I meant "hangul(hangul, hangeul)".

According to a native <ftp://dkuug.dk/ftp.anonymous/email/iso15924/277> the
correct form are the ones between parenthesis (with an added apostrophe
between han'gul).

: From: "Jian YANG" [EMAIL PROTECTED]
: Subject: Re: Re: (iso15924.275) "Hangul (Hang~ul, Hangeul)"
:   as script name (~is  adiacritical mark)
: Date: Mon, 29 May 2000 15:49:25 -0400
:
:
: «Hangeul» = Norme de romanisation du Ministère de
: l'Éducation de la Corée du Sud;
: «Hangul» = Romanisation Mc-Cune-Reischauer (la forme exacte
: est «Han'gul» : «u» with breve, et non caron; mais on a
: enlevé le signe diacritique pour accommoder la convention de
: ascii, sans doute);


On Thursday, May 20, 2004 3:51 PM, Patrick Andries va escriure:
>
> The name in ISO/CEI 10646 (F)  is « hangûl  » from a Corean dictionary
> and a Corean grammar published by the Inalco (Langues O').

Clearly, the Langues'O did adapt it to French typographical possibilities,
reversing the breve accent into a circumflex.

> Another
> suggested form in some sources, to appromixate the pronounciation.
> is « hangueul »

This is the other form, with an added, euphonical u after the g, to avoid a
complete misprononciation.

About whether all this right or not, I do not know. But I believe this text
did go through two ballots against the very people of Langues'O (?), so we
have no reason to correct now what was accepted in the standard. The only
choice right now is to type exactly what was printed, since I understand we
do not have any more the master that served to the [F]DIS texts.

Since I am not a member of TC46, and furthermore I was away from the process
last year, I might very easily be wrong.


Antoine





Re: ISO 15924 draft fixes

2004-05-21 Thread Antoine Leca
On Thursday, May 20th, 2004 23:56, Philippe Verdy wrote:

> I see no real problem if not all the different orthographies are
> listed or if they are not used universally. As long as the name is
> non ambiguous. What will be important for interchange of data will
> not be this name but the Code (or N°, or even ID in UAX#24
> properties).

I disagree. When I put content on the web, under my signature, I care about
whether is written correctly or not. And when there are different
possibilities, I prefer the best one given any other constraints (such as
technical limitations here or there.)


> So there's nothing wrong if "Han'gul" is shown to users

Sorry: this is meaningless to me as French reader. And it is a mistake
(missing breve) when it comes about the McCune-Reischauer scheme. Half-good
fallback mechanisms are usually better than nothing, but worse than anything
else. And we do have better possibilities here.


> French normally has no caron and no breve, and the circumflex is used
> to mark a slight alteration of the vowel because of an assimilated
> consonnant in the historical orthograph (most often this circumflex
> in French denotes a lost "s" after the vowel).

Or it can be for other reasons. Which consonant is involved in "dû"?


> So the curcumflex on "Hangul" would be inappropriate for French,

Please go to Langues'O for this commentary. As I wrote, you will be probably
answered with the historical context.

Also, there are a number of circumflexes already in the names, which have
nothing to do with swallowed s (like in "dévanâgarî"), which furthermore are
the main entries, unlike the case at hand. Are you proposing to drop them?
Perhaps in favour of macrons (like is done in a number of dictionnaries, by
the way)?


> [Comments-OT]
> The problem of apostrophes is that French keyboards don't have
> it, but only have a single-quote.

Huh ???
That is quite a time I did not use a French keyboard on NT/2000, but until
now, all did send apostrophes, not "single-quote".


Antoine





Re: is "n with tilde" used in French language ?

2004-07-05 Thread Antoine Leca
On Monday, July 05, 2004 1:52 PM
Anto'nio Martins-Tuva'lkin va escriure:
> From Spanish "cañón"? I'm sure there's an excellent reason to "keep"
> the tilde but trash the acute... ;-)

Yes: acute has a different meaning in French orthography (denotes a closed
vowel, and can occur twice) than it has in Spanish (denotes stress, and
normally occurs at most once).

On the other hand, dropping the tilde will:
 1) completely modify the prononciation
 2) more importantly, completely change the meaning (canon is a completely
different word...)


Furthermore, perhaps that cañon was borrowed before accents were common...
(and much later resurrected)


Antoine





Re: Importance of diacritics

2004-07-15 Thread Antoine Leca
On Wednesday, July 14th, 2004 20:12,
   Anto'nio Martins-Tuva'lkin va escriure:

>> Correct. Some people however would like to change that (i.e. so
>> that the dots are no longer optional).
>
> Not unlike the situation of French (and other) accent on capitals:
> Originally a technical constrain crudely solved to serve the tool
> instead of the user.

I do not hear proposals to *require* accentuating capitals in French (at
least in Europe), though. It is left to the domain of "good style
practices."

And by the way: the technical limitation also meant economy at many levels
(less leading so more lines on the same space, less breakage of lead forms,
less keys on typewriters keyboards, possibility to access the often used
accentuated letters with single key so ease of typing, possibility to use a
7 bit character code compatible with existing phone lines), all things in
favour of the "user" when he is a writer; and which are derived in terms of
(less) cost in favour of the reader, too.


Antoine




Re: Errors in TUS Figure 15.2?

2004-08-02 Thread Antoine Leca
On Friday, July 30th, 2004 19:47, Peter Kirk va escriure:
>>
>>> There appear to be two errors (not listed in the errata page
>>> http://www.unicode.org/errata/) in Figure 15.2 on page 391 of The
>>> Unicode Standard 4.0, the online version at
>>> http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf.

>>> The fourth column is supposed to indicate the desired rendering of
>>> . But in the text just before, ZWJ is specified as

Otto answered:
>> Read the paragraph immediately below that figure.
>
>
> OK. I did. But I shouldn't have to do that as this figure is supposed
> to be an example of what has been specified before.

Then have a look at Unicode 3.0.1
http://www.unicode.org/reports/tr27/index.html#layout> and you will
understand what did happen: there was initially the way you expected; but
then (I cannot spot exactly when, but it should be possible to find this),
for backward consideration, this very behaviour (requesting ligatures) was
defeated for Arabic only. As a result, the table was updated, and now is
about useless. We really should provide examples from others scripts (Khmer
perhaps; and Sinhala, which appears to behave exactly this way according to
SLS 1134, the Ceylanese standard)


> And there is still a problem with the text before the figure.

Which text?

I was noticing a problem, but it is not what you are pointing out.
Page 390 has a section which describes the behaviour of ZWJ. This text is
where it is written that ZWJ would request a ligature (an useful addition
here would be to signal that Arabic on one side, and scripts of India on
another, are exceptional on this respect). Then, if ligature is not
available ("otherwise"), it explains the function of ZWJ to request
cursively connection form.

. Otherwise, if either of the characters could cursively connect
  but do not normally, ZWJ requests that each of the characters
  take a cursive-connection form where possible.

In a sequence like , where a cursive form exists for X
but not for Y, the presence of ZWJ requests a cursive form for X.

Till there, I have no problem. But then, I would have expected the obvious
reversed case, where a cursive form exists for Y but not for X (and the
function of ZWJ would be to request the/a cursive form for Y). However,
there is no such text... what is written is:

Otherwise, where neither a ligature nor a cursive connection is
available, the ZWJ has no effect.

I believe this is a formal defect that ought to be corrected. Particularly
when Sinhala uses this feature quite a lot (for rakaransaya and yansaya),
and also since it is described a few lines below, with the example from the
Persian standard!

Rest of the section reads:

In other words, given three broad categories below, ZWJ requests
that glyphs in the highest available category (for the given font)
be used:

1. Unconnected

2. Cursively connected

3. Ligated


Antoine




Re: Errors in TUS Figure 15.2?

2004-08-02 Thread Antoine Leca
On Monday, August 2nd, 2004 12:51, Peter Kirk va escriure:

> On 02/08/2004 09:25, Antoine Leca wrote:
>
>>> And there is still a problem with the text before the figure.
>>
>> Which text?
>
> As I wrote before,
>
>> There also seems to be an error in the text just before the figure
>> which states "In the Arabic examples, the characters on the left side
>> are in visual order already, but have not yet been shaped."

Ah sorry, I did not pay attention to this. And since Otto (correctly)
snipped it, a fact I did not notice until now...

Yes you are right, this appears to be wrongly expressed. Moreover, the
"original" version in Unicode 3.0.1 (TR#27) already had it. I guess that it
comes cut&pasted from another source, where there were an additional column,
at the left, which had some "visual order backing store" content...

Strange that such a wrong text passed that much reviews.


> I am looking at this in order to answer an argument that the new
> proposal which I and a group of others have submitted on Hebrew Holam
> (L2/04-307, http://www.qaya.org/academic/hebrew/Holam3.pdf) does not
> conform to the TUS defined use of ZWNJ. Well, it seems that this whole
> section of TUS is such a mess that it is hard to determine what use
> actually is defined.

I do not share this position.

I am revising this section (and more) while answering PR-37, which is about
ZWJ. Since I have now spent many hours on this, I have a quite good
understanding of the issues (even if I cannot say I masterize this area.)

However, if I can agree with you about the area being fuzzy when it comes to
*ZWJ* and its numerous uses and some abuses (like Devanagari half-forms),
the verdict is not anywhere as bad about ZWNJ.
Behaviour of ZWNJ is consistent in about any place, and the correct
explanation is the one that is, among others, in chapter 15, that is that
ZWNJ restricts rendering to unconnected and unligatured forms (or prevent
use of any connected form or ligature, if you prefer), where possible.


> Another argument against our proposal is that by defining
> ZWNJ as breaking a ligature I am specifying implementation.

This is a dubious argument. Unicode specifies encodings. When two different
"meanings" are identified, different encodings are requested, so it is a
task for Unicode.

OTOH, if there is no underlying difference and the matter is purely of
presentation (like the aspect of a, like a reversed e or like a o with left
stem), then Unicode is not to be involved.

I know the border is fuzzy. ;-) or :-(.

Here, the fact it ligates or no does mean something (and this is the hard
part of the demonstration) is what should be examined. How it is implemented
is largely irrelevant (in fact, it is relevant when the result is *not*
implementable!)


OTOH, regarding your problem, I should point out that the Bengali's
precedent is anything but something that should be taken as example: it
appears to me as an ad-hoc solution built in a hurry, that happened to fit
well with certain technical implementations; it is a nightmare to handle for
others; and now there is on the table a proposal, PR-37, which among other
things will (try to) remove this hack and replace it with another, more
orthogonal (using ZWJ).


Antoine




Re: [mo/mol] and [ro/ron/rum]

2004-08-16 Thread Antoine Leca
Peter Constable wrote:
>> I doubt it's necessary to worry about erasing the
>> political distinction between Romanian and Moldavian.
>
> OK. For managing language resources, what ID should one use?

Well, you do as usual, you do both, just to be sure.

What is the problem?
You already have a plethora of ID, many of them are overload; I am currently
in Spain, and I encounter too many messages which say (a bit paraphrased):
"you are running a session in Spanish; the current application does not
support Spanish; would you like to use Spanish instead?" (of course, this is
created by the difference between traditional and modern sort; still it is
funny). Before that, I was amazed by the very existence of the locale
"French for Monaco"... I am sure nobody will switch to this locale, since it
will probably mean disappearition of a good number of French messages,
particularly in the "i15d apps" (only enabled for Belgium, Canada, France,
and Switzerland)!

Do not tell me you worry about the size of the resulting package: when you
are at handling Romanian for Moldova specifically, you have already (based
of GNP) about a hundred locales... Even if it is something very specialized
(say, a pan-Romance thesaurus), this will mean just one additional locale
into an already long list. And there is no additional work to handle the
fact there is two locales, since they are identical, so testing one or the
other should give identical results. What is really additional work, and it
is work for you (the project manager) is to ensure that both locales do not
diverge. With may imply proeminent comments in the sources (but you have to
make sure this does not show outside) or the standards, or special coherency
ruling while building.

Furthermore, you (and your company) do not want to play politics, do you?
And entering (either way, i.e. dropping Moldavian or dropping ro_MD) the
debate is always giving some heated extremists a way to bash your company;
giving both option is much more neutral (OK, doing nothing is better on this
respect; but does not satisfy :-( ).
Note that this does not apply only to Microsoft (of course, MS is much more
of a target of this kind).

By the way, the users may not be a good source for informations on such a
question. Basic users, the ones which you really want to know the advices,
have no preoccupation about IDs: once they have Romanian text and UI, they
are very happy; if that means switching the country to Rumania, well they
will do that; if there is an additional entry for Moldavian and/or Moldova,
and the result is the same, well that is fine too (and note that a number
will not even notice this); what is not good is when neither the entry for
Moldovian, nor the entry for Romanian in Moldova, shows the same level of
support as Romanian in Rumania: in such a case, they switch back in a hurry
to Rumania (and yes, I have experienced that, and more than once).
The ones which do have an eye about the ID are computer people (like the one
from gov' agency that set up, wrongly, the code at the home page ;-)), which
for their proper use are using English! The typical people of "Do what I
say, not what I do" kind. Which is why I said this can easily drift into
religious wars.


As you have already noticed in this thread, the issue is debatable, and
everybody has an idea this about; so I believe the responsive behaviour for
the IT technicians should be to stay neutral.


Hope this helps,

Antoine




Re: MSDN Article, Second Draft

2004-08-23 Thread Antoine Leca
Jungshik Shin écrivit:
>> Except in some UNIX operating systems and specialized applications
>> with specific needs,
>
>Note that ISO C 9x specifies that wchar_t be UTF-32/UCS-4 when
> __STDC_ISO_10646__ is defined.

This is of course very pedantic (I do not believe there are existing
implementations that do it), but to be exact, UCS-2 and 16-bit encoding may
be used for wchar_t, while __STDC_ISO_10646__ is #defined. It is just
required to be a value below 200112L (date of first version, here -2 part,
of ISO/IEC 10646 that defines a character beyond BMP, the equivalent of TUS
3.0.1)


Antoine




Re: Saudi-Arabian Copyright sign

2004-09-22 Thread Antoine Leca
On Tuesday, September 21st, 2004 18:50 Kenneth Whistler va escriure:
> 
> With this change in place, it seems to me that the case is
> quite clear *for* separate encoding of any circled Arabic
> letter used as a symbol. If the sequence <062D, 20DD> were
> used, instead, it would cursively join inappropriately with
> neighboring Arabic characters, unless surrounded by ZWNJ as
> well.

Then could/should we use the sequence <200C, 062D, 20DD, 200C>?


Antoine




Re: Named sequences, was: Saudi-Arabian Copyright sign

2004-09-22 Thread Antoine Leca
On Tuesday, September 21st, 2004 10:58 Peter Kirk va escriure:
>
> Is the intention of these named sequences to list all sequences which
> are commonly considered to be units, although not treated as such by
> Unicode?

By the way, this begs questions I did not see clearly spelled:

Is the intention of these named sequences to list all sequences that are
commonly rendered as single glyphes? (of course, of particular importance is
this question with regard to the Indic scripts ;-))

Is the intention of these named sequences to list all sequences that may be
commonly understood as unitary when rendered, even if often the font
technology really build it from basic pieces? Here, I am thinking about the
'barakhadi', these traditional presentations in tabular form of a (Indic)
syllabary; I also believe any abugida will have a similar presentation. I am
also thinking about all these conjuncts where the second consonant have a
distinct yet systematic glyph: a number of them, particularly when they
involve -y -r or sometimes -v are not considered special; but there are more
tangential cases, like -n -m -l, in Nagari, Oriya, Malayalam, etc.

Of course, there may be a considerable debate about 'commonly'...
Particularly since there exists the danger that having such a table may be
objected by some people as a reason to refuse a combination that would NOT
be registered; also there is the problem of the sequences for which two+
different sequences are supposed to mean the same thing.


Antoine




Re: internationalization assumption

2004-09-29 Thread Antoine Leca
On Tuesday, September 28th, 2004 03:22 "Tom" wrote:
>
> Let's say.  The test engineer ensures the functionality and validates
> the input and output on major Latin 1 languages, such as German,
> French, Spanish, Italian,

Just a side point: French cannot be fully addressed with Latin 1.
Of course, it is good enough for things like say order entry (since all the
keyboards do not provide access to the "missing" letters), but if you care
about i18n, it is probably not good enough.

Also, it strikes me that the test engineer validates "major Latin 1
languages" but yet misses the most used of them.

> If those products handle all languages as addressed above, could it
> be assumed that the entire character sets in whole latin 1

Probably not. About every language has its peculiarities (which is a good
reason to have validated platforms instead of trying to validate products
one by one). For example, in Catalan, we have special rules for hyphenating
"·" (which is a Latin1 character, present on the keyboards); I am sure there
are other special rules for other languages, too; in fact, I am sure that
there is no product nor any platform that can claim to support *all* such
rules, so there are always limits.

Antoine




Re: internationalization assumption

2004-09-30 Thread Antoine Leca
Dear Philippe,

[ I write to the list, since there is no point sending two posts. Internet
is full enough of errant SMTP mails anyway. ]

On Wednesday, September 29, 2004 17:42, Philippe Verdy va escriure:

> From: "Antoine Leca"
>> Just a side point: French cannot be fully addressed with Latin 1.
>
> True, due to the missing (but rare) oe or OE ligature

Rare? "beef", "heart", "eye", "egg" are anything but rare words, methinks.
Even in French.

Or do you mean 'rare' as meaning 'strange'?

Also, it becomes more and more important to have the euro sign.

> Anyway, no French users actually complain of this omission:

Ah! So there is a lot of people, myself and some well-known here Canadians
included, that do not qualify as "French users", according to your rules.

> in addition, French keyboards typically never include a
> key to enter these ligatures,

I mentionned this point in the part you snipped.
I can also add that the usual keyboards in France do not have a possibility
to accentuate capitals letters, so as a result usage is now to left them
out, with a corresponding misunderstandings in all-caps texts.


> The "ae" ligature is used in French, but not in the common language
> (I think it is used only in some technical juridic or religious
> terms, inherited from Latin, or in some medical and botanic jargon):
> I can't even remember of one French word that uses it;

This is off-topic, but a current orthography of "et cætera" in French uses
the æ ligature (which is a letter according to Unicode names, and which is
collating after z in Catalan, Croatian, Magyar, Romanian, Slovak,
Slovenian...)


> With those considerations, would a software that only supports the
> ISO-8859-1 character set be considered "not ready" for French usage?
> I think not, and even today most French texts are coded with this
> limited subset, without worrying about the absence of a rare
> ligature, whose absence is easily infered by readers.

"Easily"? Well well well. Judging by the difficulty of some young children
to make the distinction between c½ur and coexister, I do not buy the
argument.

Similarly, it was possible in 1973 to consider that ISO 646-FR (AFNOR NF Z
62010) was sufficient for French usage, with the insertion of a backspace
between the vowel and the ^. However, computers are now a bit more powerful,
perhaps we can do ourselves a favour and drop those legacy constraints (and
it is particularly important for the usual posters in this forum to avoid
giving wrong impressions/informations to the newcomers in i18n, I believe).


Cordialement,

Antoine




Re: [indic] CLDR 1.2 Alpha now available

2004-10-01 Thread Antoine Leca
Hi Rick,

On Friday, October 1st, 2004 00:17, Rick McGowan va escriure:
>
> The Unicode Consortium is pleased to announce that the alpha version
> of the Common Locale Data Repository (CLDR) 1.2 is available for
> public review.

Can you please clarify what are the intent with regard to the entries
currently filled as "bugs"?

There is a quite long list of them (mine is #173, there are only 74 closed,
so I assume it is about the 100th; still it is marked "target: 1.2"), and I
have no visibility about the things that are known to be applied before the
beta stage, and the things that are supposed to be applied and to which we
should now filled additional bug reports because something was missed while
applying it.

As an example, for my bug
(http://www.jtcsv.com/cgibin/locale-bugs/data?id=173) does not seem applied
at all. That is fine with me, I understand it is somewhere in the queue, and
will be examined in due time.
However, I noticed that something I adressed was in fact already addressed
in another bug, #43 (http://www.jtcsv.com/cgibin/locale-bugs/data?id=43);
and this one seems to have been applied, at least in part; but I cannot
decide if the part missing is known to need further revision, planned for
next weeks, or if it would be better to register additional comments to draw
attention to these points.

I understand very well that your resources on this are pretty limited (or at
least it seems), so I certainly do not to had any unnecesary burden.


Antoine




Sinhala conventions

2004-10-25 Thread Antoine Leca
Sorry if you receive this twice: I posted it in the Indic list (appropriate
AFAIK) but copied the general list since experts not reading the first might
help. Please answer only on the Indic list to avoid more duplicates; thanks
in advance.


Following a recent thread, I am trying to understand the minutes of the June
meeting. I read there

[99-C37] Consensus: The UTC recommends that "right-side" forms
of conjuncts in Sinhala be represented by a sequence of . [L2/04-131]

L2/04-131 itself is forbidden for me to get with
http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/04-131, but it exists
an equivalent copy publicly available at
http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2737.pdf (I guess it is the same
because the latter says explicitely "L2/04-131" ;-)). This is a committee
draft, released for public comments in 2004-04-15, of revision 2 of SLS 1134
(the encoding of Sinhala, a Sri Lanka standard).


I am very interested to learn about the "zwj,vir,cons" sequence, and not
only because I spent a few hours end of July to analyse this very sequence
(in response to http://www.unicode.org/review/pr-37.pdf), while it appears
from the minutes that a few week before a decision was taken in the
committee to bring this very sequence into general use, but for yet another
use...


What is really "interesting" (so I think) is that this sequence
(zwj,vir,cons, really 200D 0DCA) does not appear in the said document;
neither is the expression "right-side"... So what is happening here?

A bit of context is probably needed here, so I address anyone to re-read
Michael's http://www.evertype.com/standards/si/iso10646-to-sls1134.html
(thanks Michael!), written in 1997 (so anything there should be taken with a
pinch of salt, particularly the use of the joiners) which described, around
the end, the problems with conjuncts in Sinhala (script).


If I read correctly:

 -- the usual case, i.e. in Sinhala language (Elu), is to use explicit
virama (al-lakuna, 0DCA); it is BUD-DHO in Michael's example; it does not
need any joiner (<0DB6, 0DD4, 0DAF, 0DCA, 0DB0, 0DDC>);

 -- when a ligature conjunct, Brahmi's style, is requested, ZWJ/200D is put
_after_ the virama; this also happens for rakaransaya (subjoined ra),
yansaja (post-base ya) and repaya (similar to Nagari's repha), common in
Sinhala; to stay with Michael's exmaple, this one is BU-DDHO, and would be
encoded <0DB6, 0DD4, 0DAF, 0DCA, 200D, 0DB0, 0DDC>.

Till there, I believe it is exactly what spells L2/04-131 / N2737
(particularly §§ 5.6 to 5.8).

If we study Michael's document, we can understand that the so-called Pali
"kerned" conjuncts are not adressed, BU-[DDH]O.

So my educated guess (helped by documents recently made available in Sri
Lanka) is that the cons/200D/0DCA/cons sequence is used to encode these
"kerned" conjuncts or "touching letters". As a result it ought to be encoded
<0DB6, 0DD4, 0DAF, 200D, 0DCA, 0DB0, 0DDC>.


Can someone confirm this?

Also, can someone confirm that what is described here is actually what will
put in SLS 1134 rev. 2? (or the best approximation of)


Antoine




Re: official languages of ISO / IEC (CIE)

2004-11-09 Thread Antoine Leca
On Tuesday, November 8th, 2004 23:13Z E. Keown va escriure:
>
> Does either the ISO or the IEC have official
> languages?

As far as I know, yes, three.

BTW, about U.N. I believe there are 6 "working languages."


>  Whether official or not, is French the
> 'second language' of the standards world?

You are not expecting us to feed this troll, are you?


> I'm about to translate something into technical
> French.I still didn't purchase a technical French
> dictionary because the ones I've seen didn't have
> enough computer terminology.

Anyway, if you want to do technical translations of computer matters into
French, you'll invariably fall into one (or both) of two traps: either using
much too much « anglicismes » (i.e. words borrowed from English while a
equivalent and perfectly valid French word does exist), or using official
neologisms that nobody use in practice.

To make matter worse, the status of a word varies with your position on the
planet: i.e. some words are customary in France while other are in Quebec.
Etc.


Reality is that the language _spocken_ by the techies, at least in France
(or Spain) but also in Quebec I believe, is full of English words which
should not be used. Things are a bit /better/ (from my point of view; read
/different/ for a more neutral view) with written material.

And I do not say that because you are not a native: we all do that as well,
for that translating to French technical material is usually very difficult
to do right, particularly for non-specialist in the field as I am.

Now I let Patrick comment on this one, I am sure he will add things ;-)))
Just keep in mind *he* is a professional.


Bonne chance pour votre traduction.

Antoine




Re: My Querry

2004-11-23 Thread Antoine Leca
RE: My QuerryMike Ayers wrote
> Addison Phillips Sent on Tuesday, November 23, 2004 9:14 AM
>> That is, amoung other things
>> UTF-8 was designed specifically to be compatible with C
>> language strings.
>
> Wrong!

What is wrong? That UTF-8 (born FSS-UTF) was designed to be compatible with
C language strings?
Of course it was. Even more, it had to be compatible with the '/' codepoint,
very important in Unix.

Another problem entirelly is to determine if it _succeeded_ at this aim.


> UTF-8 is fully compatible with ASCII,

I do not know what does mean "fully compatible" in such a context. For
example, ASCII as designed allowed (please note I did not write "was
designed to allow") the use of the 8th bit as parity bit when transmitted as
octet on a telecommunication line; I doubt such use is compatible with
UTF-8.


I do not object to your point about impossibility to represent NUL with C.
But as you say, very often this is not an actual problem.


Antoine




Re: My Querry

2004-11-23 Thread Antoine Leca
Philippe Verdy écrivit:

> From: "Antoine Leca" <[EMAIL PROTECTED]>
>> For example, ASCII as designed allowed (please note I did not write
>> "was designed to allow") the use of the 8th bit as parity bit when
>> transmitted as octet on a telecommunication line; I doubt such use is
>> compatible with UTF-8.
>
> The parity bit is not data; it's a framing bit used for transport/link
> purpose only.

Did I say otherwise?
Even if it is not "data", you can store it inside an octet, along with 7
bits of /data/. You cannot do something similar if you have 8 bits of data,
it won't fit inside the octet. Which was my point.


> ASCII is 7 bit only, so even if a parity bit is added (parity bit can
> be added as well to 8-bit quantities...), it won't be part of the
> effective data, because once the transport unit is received and
> checked, it has to be cleared

Sorry, no: there is no requirement to clear it.
You are assuming something about the way data are handled. When you handle
ASCII data using octets, you can perfectly, and conformantly, keep some
other "data" (being parity or whatever) inside the 8th bit; so with even
parity AT SIGN will be managed as 192, without any kind of problem (for
you). It might even be very convenient to keep this bit as it is, for
example if you know you will have to forward it to another equipment along
some communication line.

In fact, there was (at least a few years ago) some mail gateways that did
exactly that, and I found recently that this hack I used about 25 years ago
was not THAT good.
;-)


> By saying UTF-8 is fully compatible with ASCII, it says that any
> ASCII-only encoded file needs no reencoding of its bytes to make it
> UTF-8.

Looks like a good definition of upper (or backward, as you want) compatible.
I was titling at "fully", particularly since the discussion was picky about
NUL wrt C.

What you are writing is that a 7-bit byte encoded in ASCII is "fully
compatible" with an 8-bit byte encoded in UTF-8... Looks strange to me
written that way, doesn't it?


> Note that this is only true for the US version of ASCII

Anything else would be whateverSCII, but definitively not ASCII, methinks...


> "ASCII" is normally designating only the last standard US variant

Funny. "Last"... You know of /several/ variants?
I do know of several variants of ISO/IEC 646, and even of several variants
of its /reference/ version. And then there is ISO/IEC 2375, and 4873. But
that is another story entirelly.

You were not saying that UTF-8 is fully compatible with *ISO/IEC 646*
instead, were you?


Antoine




Re: Another Querry

2004-11-24 Thread Antoine Leca
On Wednesday, November 24th, 2004 04:02Z Harshal Trivedi va escriure:

> How can i determine end of UCS-2/UCS-4  string while encoding it in C
> program?

It depends how you are storing and more importantly managing it.

If you consider it as mere arrays of uint16_t/uint32_t, with your own
functions to do any processing you want, you can use whatever ways to know
'end of strings' that is convenient to you: either store a marker (one can
think at U+ for that), or registering the size of the array, or even
doing both things.
If you want to go this way and still have no code written, you should really
have a look at ICU, basically it is/was a library that did exactly these
kind of things for you.

On the other hand, if you want to take advantage of the resources the C
library are offering, perhaps your platform already has some kind of UCS
encoding available (read section 5.2 of The Unicode Standard as a starting
point), then there are no real difference than with plain strings: 0 is used
to flag the end of a string.


Antoine




Re: Question on Canonical equivilance

2004-11-25 Thread Antoine Leca
On Wednesday, November 24th, 2004 16:26Z Tim Greenwood va escriure:

> All of the spacing combining marks (general category Mc) except
> musical symbols have a canonical combining class of 0.
> Why is this?

About the Indic vowel signs, I assume it is this way to avoid them being
reordered (in weird ways), particularly when there are multi piece vowels
involved.


> The Canonical
> Combining Class Values in UCD.html has entries and values for left
> attached and right attached - but no characters have these values.

They (the Indic vs) happen to have >0 class before v.2.1.8 (1998).
I believe UCD.html still reflects this past state.

For example, the accompagning README tells us:

  Note that as of the 2.1.8 update of the Unicode Character Database,
  the decompositions in the UnicodeData.txt file can be used to recursively
  derive the full decomposition in canonical order, without the need
  to separately apply canonical reordering. However, canonical reordering
  of combining character sequences must still be applied in decomposition
  when normalizing source text which contains any combining marks.

I assume it has to do with the work of TR15 that you might consult
(http://www.unicode.org/reports/tr15/tr15-10.html) for enlightment.


Antoine




Misuse of 8th bit [Was: My Querry]

2004-11-25 Thread Antoine Leca
On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure:
>
> I'm not seeing a lot in this thread that adds to the store of
> knowledge on this issue, but I see a number of statements that are
> easily misconstrued or misapplied, including the thoroughly
> discredited practice of storing information in the high
> bit, when piping seven-bit data through eight-bit pathways. The
> problem  with that approach, of course, is that the assumption
> that there were never going to be 8-bit data in these same pipes
> proved fatally wrong.

Since I was the person who did introduce this theme into the thread, I feel
there is an important point that should be highlighted here. The "widely
discredited practice of storing information in the high bit" is in fact like
the Y2K problem, a bad consequence of past practices. Only difference is
that we do not have a hard time limit to solve it.

The practice itself did disappear quite a long time ago (as I wrote, myself
did use it back in 1980 and perhaps also in 1984 in a Forth interpreter that
did overuse this "feature"), and right now nobody in his common sense will
even think of this idea
(OK, this is too strong, certainly one can show me examples of present day
uses, probably more in the U.S.A. than elsewhere; just as I was able to
encounter projects /designed/ in 1998 with years stored as 2 digits, and
then collating dates on YYMMDD.)

However, what is a real problem right now is the still widely expanded idea
that this feature is still abundant, and that the data should be
"*corrected*". So one should use toascii() and similar mechanism that takes
the /supposed corrupt/ input and make it "good compliant 8-bit US-ASCII" as
some of the answers that were made to me pointed out.

It should be now obvious that a program that *keeps* a eventual parity
information received on a telecommunication line and pass it unmodified to
the next DTE, is less a problem with respect to eventual UTF-8 data that the
equivalent program that actually *removes* unconditionnally the 8th bit.


The crude reality is that the problem you are referring above really comes
from these castrating practices, NOT from the now retired programs of the
'70s that for economy did re-use the 8th bit to store another information
along the pipeline.
And I am noting that nobody advocated in this thread about USING the 8th
bit. However, I saw remarks about possible PREVIOUS uses of it (and these
remarks were accompanied by the relevant "I remember" and "it reminds me"
that might show advises from experimented people toward newbies rather than
easily misconstrued or misapplied statements).
On the other hand I also saw references to practices of /discarding/ the 8th
bit when one receives "USASCII" data (some might even be misconstrued to
make one believe it was normative to do so); and there latter references did
not come with the same "I remember" markers, quite the contrary; and present
practices of Internet mail will quickly show that these practices are still
in use.

In other words, I believe the practice of /storing/ data into the 8th bit is
effectively discredited. What we really need today is to discredit ALSO the
practice of /removing/ information from the 8th bit.


Antoine




Re: Misuse of 8th bit [Was: My Querry]

2004-11-26 Thread Antoine Leca
On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure:
>
> In ASCII, or in all other ISO 646 charsets, code positions are ALL in
> the range 0 to 127. Nothing is defined outside of this range, exactly
> like Unicode does not define or mandate anything for code points
> larger than 0x10, should they be stored or handled in memory with
> 21-, 24-, 32-, or 64-bit code units, more or less packed according to
> architecture or network framing constraints.
> So the question of whever an application can or cannot use the extra
> bits is left to the application, and this has no influence on the
> standard charset encoding or on the encoding of Unicode itself.

What you seem to miss here is that given computers are nowadays based on
8-bit units, there have been a strong move in the '80s and the '90s to
_reserve_ ALL the 8 bits of the octet for characters. And what was asking A.
Freitag was precisely to avoid bringing different ideas about possibilities
to encode other class of informations inside the 8th bit of a ASCII-based
storage of a character.

In a similar vein, I cannot be in agreement that it could be advisable to
use the 22th, 23th, 32th, 63th, etc., the upper bits of the storage of a
Unicode codepoint. Right now, nobody is seeing any use for them as part of
characters, but history should have learned us we should prevent this kind
of optimisations to occur. Particularly when it is NOT defined by the
standards: such a situation leads everybody and his dog to find his
particular "optimum" use for these "free space", and these classes of
optimums do not generally collides between them...


Antoine




Re: Nicest UTF

2004-12-02 Thread Antoine Leca
On Wednesday, December 01, 2004 22:40Z Theodore H. Smith va escriure:

> Assuming you had no legacy code. And no "handy" libraries either,
> except for byte libraries in C (string.h, stdlib.h). Just a C++
> compiler, a "blank page" to draw on, and a requirement to do a lot of
> Unicode text processing.
<...>
> What would be the nicest UTF to use?

There are other factors that might influence your choice.
For example, the relative cost of using 16-bit entities: on a Pentium it is
cheap, on more modern X86 processors the price is a bit higher, and on some
RISC chips it is prohibitive (that is, short may become 32 bits; obviously,
in such a case, UTF-16 is not really a good choice). On the other extreme,
you have processors where byte are 16 bits; obviously again, then UTF-8 is
not optimum there. ;-)

Also, it may influence if you have write access to the sources for your
library: if yes, then it could be possible (at a minimal adaptation cost) to
use it to handle 16-bit ot 32-bit characters. Even more interesting, this
might already exist, in form of the wcs*() functions of the C95 Standard.

It also depends, obviously, on the kind of processing you are doing. Some
are mainly handling strings, so the transformation format is not the most
important thing. Yet others are handling characters, and then UTF-8 is less
adequate because of the cost of relocating. On the other hand texts are
stored in external files, and if the external format is UTF-8 or based on
it, then it might be a bias toward it.

And finally it may depend on how many different architectures you need to
deploy your programs. C is great for its portability, yet portability is a
tool, not a necessary target. An unique user usually does not care how
portable is the program he is using, provide it does the job and it results
cheap (or not too expensive). I agree portability is a good point for IT
managers (because it foments competition, with is good to cut costs.) But on
the other hand, too much portability can be counter-productive to everyone
(for example, writing a text processor in C which allows characters to be
stored directly as 8-bit as well as UTF-16 bytes. Or using long for
everything, in order to be potentially portable to 16-bit ints, even if the
storage limitations will impede practical use.)


I believe the current availability of 3 competitive formats is a fact that
we have to accept. It is certainly not as optimum as the prevalence of ASCII
may have been. It is certainly a bad thing for some suppliers such as those
that are writing those libraries, because it means ×3 work for them and an
augmentated price for their users (being in sales price or being in delay of
availability of features/bug corrections/etc.) Moreover, the present
existence of widely available yet incompatible installed bases for at least
two of the formats (namely UTF-16 on Windows NT and UTF-8 on Internet
protocols) means additional costs for about all the industry. This may mean
more workload for those that are actually working in this area ;-), but also
more pression upon them from part of their managements, and results in waste
when seen from the client side, so not a good thing for marketing.
Yet it is this way, and I assume we cannot do many things to cure that.

Now let's proceed to read the rest...


> I think UTF8 would be the nicest UTF.

So that is your point of view.


> But does UTF32 offer simpler better faster cleaner code?

Perhaps you can actually try to measure it.


> A Unicode "character" can be decomposed. Meaning that a character
> could still be a few variables of UTF32 code points! You'll still
> need to carry around "strings" of characters, instead of characters.

This sillogism is assuming that any text handling requires decomposition. I
disagree with this.


> The fact that it is totally bloat worthy, isn't so great. Bloat
> mongers aren't your friend.

Again, do you care to offer us any figures?


> The fact that it is incompatible with existing byte code doesn't help.

See above.

> UTF8 can be used with the existing byte libraries just fine.

It depends on what you want to do. For example, using strchr()/strspn() and
the like may be great if you are dealing with some sort of tagged formats
such as SGML; but if your text uses U+2028 as end-of-line indicator, it
suddently becomes not so great...

> An accented A in UTF-8, would be 3 bytes decomposed.

Or more.

> In UTF32, thats 8 bytes!

And so? Nobody is saying that UTF-32 is space efficient. In fact, UTF-32
specifically trade space against other advantages. If you are space-tight,
then obviously UTF-32 is not a choice. That is another constraint. Which you
did not add to the list above.

On the other hand, nowadays, the general use workstation used for text
processing has several hundred of megabytes of memory. That is, several
scores of megabytes of UTF-32 characters, decomposed and so on.
The biggest text I have at hand is below 15 M. And when I have to deal with
it, I am quite cl

Re: current version of unicode font (Open Type) in e-mails

2004-12-03 Thread Antoine Leca
> Arial Unicode MS version 1.01 is most current and shipped with Office
> 2003. I called it OpenFont. Sorry! I double-clicked on its icon -
> whith a colored "OT" - in \WINDOWS\Fonts again it says after version
> 1.xx "(Opent Type)". I took that to mean Open Source or something
> more open than MS's restrictive policy about it.  I still claim it
> does not deserve the label OPEN at present.

All what is labeled "Open" does not mean OpenSource, much less "free"; these
days, it is much more a marketing gimmick. For example, here in Europe
(where it is not common), some stores are marked "open all night long." But
nobody expects them to offer free food ;-).
Open Type is a technology, that requires the application to provide some
processing itself (instead on relying on the graphical engine as with Type 1
or TrueType): so the font is "open" to the application.
And to avoid confusion: OpenType (well, TrueType Open as it was named then)
predates OpenSource.


Also, I am not sure of the licensing conditions of Arial Unicode MS,
particularly v.1.01 (I did not license Office 2003, so I cannot easily
check.)
I was understanding that back with Office 2000 and 2002, once you licensed
Office on one computer or for one user, you owned the right to use it even
with another operating system, say Linux. (On the other hand, I know things
are different for fonts like Latha, that come with the operating system
itself, and it seems prohibited to use them on another system.)


> Can they just be copied into \WINDOWS\Fonts as is the easiest
> 'installation' of ttf-fonts ?

Not really. The system will not notice it, at least until you reboot it.
A trick that works well for me is to copy it there, then to launch Explorer
on this very directory. This way Windows forces a re-enumeration of the
folder, and it will register your newly added font.
Also, alternatively, you can open the folder in Explorer and drop the font
inside it.


> Vice versa MS-fonts can be installed under Linux,

See above: you really chould check the licensing conditions. And beware,
since they varies from font to font, or even from release to release (for
example, with Arial, you can of course install the one that come with
"freecorefonts", but i understand you are not allowed to install the newer
release that come with XP.)


Antoine




Re: current version of unicode-font

2004-12-03 Thread Antoine Leca
On Friday, December 03, 2004 13:10, Cristian Secară va escriure:
>
> However, the .ttf fonts that ship with their products are showing an
> OT icon. I don't know how it's done technically.

Technically, it is done by including a (valid) 'DSIG' (digital signature)
subtable into the font file, that is a table whose only aim is to guarantee
that the fontfile has been unaltered (using cryptographic seals as used for
certificated e-mails).

The interesting thing is that while the specification for this 'DSIG' table
is part of OpenType, it is completely unrelated to what people usually
associates with this technology, that is the possibility to have complex
script and advanced typography support (see my previous post for details,
since I did the mistake myself ;-).) Neither it is related to the fact (also
introduced by the OpenType specifications) to have the outlines and hints
stored in Postscript format (rather than the traditional TrueType format)

As a result, having the nice-looking OT on a font is misleading, it just
means the designer have paid Verisign for a class 3 certificate and signed
its font. And last but not least it ensures you the font has not been
modified (I am hoping Windows is actually checking the seal, but when I
thought a bit more I am not 100% sure, since this is a process that is
somewhat time-consuming, and it does not appear to me that Windows is less
quick to draw the content of this folder...)


Antoine




Re: OpenType vs TrueType (was current version of unicode-font)

2004-12-04 Thread Antoine Leca
Peter Constable écrivit:

>> On Behalf Of Christopher Fynn
>> If a Windows application needs to properly display Unicode text for
>> languages such as Hindi, Tamil, Bengali, Nepali, Sinhala, Arabic,
>> Urdu and so on then it probably needs to support OpenType GSUB and
>> GPOS lookups.
>
> Not just "probably".

Well, there are other rendering technologies than Uniscribe; and some of
them even succeed at displaying complex scripts...

For a contrived yet verifiable (OpenSource) example, let have a look at Eric
Mader's LayoutEngine (in ICU) using Apple (GX) fonts with a Unicode cmap.
And yes I am talking of something that can run on Windows.

Chris is correct, as Uniscribe is the undiscuted leader in year 2004.


Antoine




Re: latin equivalent to specific indian characters

2004-12-05 Thread Antoine Leca
I fail to see the connection between your question and Unicode.


Samedi 4 décembre 2004 13:18Z, Rene Hache écrivit:

> To whom it may concern,

;-)


> I writing because I would to know if someone can help with certain
> Sanskrit/Pali characters in roman scripts.

Certainly there is a LOT of material this about around the net. Google is
certainly the best answer one can give to you.
As second level helper, it is my believeing that you will encounter more
material using Sanskrit as keyword than with Pali. This should not mislead
you: as always with Google and co., more material means overall more wrong
ways to check.


> Most characters are simple, like vowels with macrons, or some letters
> that have either a dot below or above.

If you want to see things this way, you should try a coded character set
that fit this description. Fortunately, such a thing exists, and a good
choice could be IS 13194:1991 widely know as ISCII; in this coded character
set, dha is only one codepoint (namely C5). ISCII is a good choice because
you can easily print it using ad hoc software (CDAC is a good keyword here),
and also because you can somewhat easily map from or to Unicode. Of course
collation, and translitteration to Nagari or other script used in India is
trivial, they were objectives of the design.

On the other hand, if you want to handle the textual material in Unicode (if
not, I really cannot see why you are asking this here), you will have to use
a not straightforward yet perfectly possible collating process. The fact
that dha is a single "letter" is not a real problem (this is a simple
contraction, any not stupid algorithm should offer this), more interesting
things appear when you realise that while dha is one letter, dhi are two.
Even more interesting is that in traditional order, ã (nasalisation noted
with candrabindu) precedes a (without nasalisation). And real complexity
begins when you study the rules to collate the anusvara (written as a dot
above in Nagari script, and which can stand for itself of for a nasal of the
following consonant).


Antoine




Re: Nicest UTF

2004-12-06 Thread Antoine Leca
Asmus Freytag wrote:
> A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider

> 3) additional cost of accessing 16-bit registers (per character)

> For many processors, item 3 is not an issue.

I do not know, I only know of a few of them; for example, I do not know how
Alpha or Sparc or PowerPC handle 16-bit datas (I did hear different sounds.)
I agree this was not an issue for 80386-80486 or Pentium. However, for the
more recent processors, P6, Pentium 4, or AMD K7 or K8, I am unsure; and I
shall appreciate insights.

I remember reading that in the case of the AMD K7, for instance, 16-bit
instructions (all? a few of them? only ALU-related, i.e. exclusing load and
store, which is the point here? I do not know) are handled in a different
way from the 32-bit ones, e.g. reduced number of decoders. The impact could
be really important.

I also remember that when the P6 was launched (1995, known as PentiumPro),
there was a bunch of critics toward Intel because the performances of 16-bit
code was actually worse than an equivalent Pentium (but there were an
advantage for 32-bit code); of course this should be considered in the
context, where 16-bit (DOS/Windows 3.x) code was important, something that
faded. But I believe the reasoning behind the arguments should still hold.

Finally, there is certainly an issue about the need to add a prefix with the
X86 processors. The issue is reduced for the Pentium4 (because the prefix
does not consume space in the L1-cache); but it still holds for L2-cache.
And the impact is noticeable; I do not have figures for the access to UTF-16
datas, but I know that for when using 64-bit mode (with AMD K8), the need to
have a prefix to access 64-bit data, so consuming code cache space for it,
was given as cause for a 1-3% penality in execution time.

Of course, such a tiny penalty is easily hidden by other factors, such as
the others Dr. Freitag mentionned.


> Given this little model and some additional assumptions about your
> own project(s), you should be able to determine the 'nicest' UTF for
> your own performance-critical case.

My point was that the variability of the factors headed to keeping the three
UTFs as possible candidates when one consider writing a "perfect-world"
library. Can we say we are in agreement?

By the way, this will also mean that the optimisations to be considered
inside the library could be very different, since the optimal uses can be
significantly different. For example, use of UTF-32 might signal a user bias
toward easy management of codepoints, disregarding memory use, so the used
code in the library should favour time over space (so unrolling loops and
similar things could be considered).
UTF-8 /might/ be the reverse.


Antoine




Re: OpenType not for Open Communication?

2004-12-09 Thread Antoine Leca

Peter C. wrote:
>> font vendors are creating fonts that use Unicode, platform vendors
>> (at least Mac and Windows -- Linux is too fractured a scene to
>> make a general statement)

On Monday, December 6th, 2004 18:40Z Edward H. Trager va escriure:
>
> The really big, important applications and code libraries on Linux

> all use Unicode. Recent Linux distributions [...] Novell/SuSE ship
> with UTF-8 locales enabled by default right out of the box.

Also, smaller projects like Indlinux is using also UTF-8 as a base, even if
other characters sets like ISCII would be more logical. And to put a
counterpoint, nowadays an appreciable part of the Indic softwares available
on Windows still are using proprietary encodings (but things are changing
here).

The real point of Peter is that Windows NT internally coerce any string to
be Unicode (more exactly UTF-16, and sometimes UCS-2 or UTF-32), and I read
that current versions of MacOS do the same. Linux, repectful of its Unix
origins, does not do that, it is completely encoding-neutral (provided the /
is used as path separator); by the way Linux does not handle fonts, so it
does not have to get involved in this debate. X11, which stands atop of
Linux (or *BSD, or the Windows kernel), might be a bit more picky, but I
understand it still accepts about everything 8-bit-based, and does not
convert internally.

This is actually a difference in design: open systems (as they were named)
were designed from day one to be independant of the operationnal charsets,
and UTF-8 is about one of them, most used nowadays. Linux and X11 inherited
from this state of affairs (slightly more recent, Plan 9 did not, and it
sticks to UTF-8 internallly.) On the other hand, Windows inherited (I
understand from IBM, evolving from DOS) an attachment to a designated
operationnal charset; once upon a time it was a big problem (and it still is
in a number of cases), but with the advent of Windows NT which allows
Unicode as the designated charset things are getting better. Of course the
transition was harsher than the one to UTF-8 with open systems.

However, you can still publish an "ANSI" application in 2004 for Windows. It
is essentially the same as publishing an application for *nix which
_requires_ a iso-8859-x or EUC-XX locale: not a sensible thing to do, but it
may happen.



Antoine




Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Antoine Leca
On Monday, December 6th, 2004 20:52Z John Cowan va escriure:

> Doug Ewell scripsit:
>
>>> Now suppose you have a UNIX filesystem, containing filenames in a
>>> legacy encoding (possibly even more than one). If one wants to
>>> switch to UTF-8 filenames, what is one supposed to do? Convert all
>>> filenames to UTF-8?
>>
>> Well, yes.  Doesn't the file system dictate what encoding it uses for
>> file names?  How would it interpret file names with "unknown"
>> characters from a legacy encoding?  How would they be handled in a
>> directory search?
>
> Windows filesystems do know what encoding they use.

Err, not really. MS-DOS *need to know* the encoding to use, a bit like a
*nix application that displays filenames need to know the encoding to use
the correct set of glyphs (but constrainst are much more heavy.) Also
Windows NT Unicode applications know it, because it can't be changed :-).

But when it comes to other Windows applications (still the more common) that
happen to operate in 'Ansi' mode, they are subject to the hazard of codepage
translations. Even if Windows 'knows' the encoding used for the filesystem
(as when it uses NTFS or Joliet, or VFAT on NT kernels; in the other cases
it does not even know it, much like with *nix kernels), the only usable set
is the _intersection_ of the set used to write and the set used to read;
that is, usually, it is restricted to US ASCII, very much like the usable
set in *nix cases...


Antoine




Malayalam Half-U: how

2002-11-08 Thread Antoine LECA
Hi folks,

A problem was signaled in the Microsoft VOLT mailing list (this list
should be dedicated to typographic, but it appears that it deals
more with Indic scripts, because VOLT is the MS tool to use to encode
OpenType informations in a font, which in turn is required to display
Indic scripts on Windows.)

The problem deals with Malayalam half-u. An user signaled as an error
the fact that Uniscribe displays a dotted circle in the middle of a
Malayalam half-u. He wrote
   U+0D15 U+0D41 U+0D4D  (ka, u, virama)
and Uniscribe displayed (in reformed style) the ku syllable, then a
dotted circle, then a virama sign hanging alone.

Of course, the problem is that Uniscribe expects virama to come only
after consonants, so it displayed it as an error. But I believe the
misunderstood hides a real problem: how can be displayed the half-u.
Hence I am coming here to see what the gurus believe about this.

To help you, I have done some researches. Here is what I have found.

First, the phonetic reality: the root is when a word ends with halanta
(virama); while in other languages, this "kills" the a-sound, in
Malayalam it rather replaces it with the half-u sound, particularly
when the consonant is a conjunct.
This is for example described in the ISO 15919 standard, available
with detailed explanations at Dr Anthony P. Stone site,
http://homepage.ntlworld.com/stone-catend/trind.htm>

According to Varamozhi (a site well informed about Malayalam),
http://varamozhi.sourceforge.net/varamozhi-doc/varamozhi-6.html>
when it comes to representation, there exists differing writing
"styles" contemplating this single phonetic reality; in North
Kerala, usage is to write the halanta sign in place, and Done!
Obviously, this is very much in line with the other scripts.

However, in South Kerala, as Mr. Cibu said, usage is to write the
halanta sign as well as to show the matra for the u vowel.
While it is said that this latter usage occurs with the reformed
style, I have seen examples with the traditional style as well
(although this is from a book printed in Madras, so it might be wrong.)
Obviously, the user of Uniscribe intended to display this combination,
which to him is the normal way to display a word, when it ends with
halanta!

Knowing that, we can now notice that Unicode has a note under Malayalam
virama (U+0D4D), saying it is the same as Malayalam half-u. To me, this
means that under Unicode, the half-u is supposed to *not* be specifically
encoded, and only the usage from North Kerala is supposed to be followed.

Other relevant informations: ISCII-91 seems mute about the subject,
and THE CDAC products (like iLeap) seems unable to render the half-u
in Malayalam (until one "cheats" using the INV pseudo-consonant.)

It is too late to discuss the pros and cons of the choice of Unicode,
back in 1992 (probably, they chose to ease as far as possible the
unification of encoding, in order to ease sorting and similar tasks.)
Now, the problem is, if someone wants to specifically encode the
showing of the u matra, in a context (like is Uniscribe) where both
usages from North and South Kerala could be intended, how should it be
done? It seems rather natural to use then the combination
 U+0D41  U+0D4D,
following the precedent established in Unicode 3.1 (IIRC) for the modern
Bengali A and E initial vowels (from English borrowed words), which are
written as Bengali A or E, followed by virama then ya (hence a exception
to the rule virama may only follow a consonant.)

Are the gurus here OK with this "solution"?

Can it be "sanctified", for example with the inclusion of the adequate
words in some revision of Unicode?


If this is agreed, when dealing with other aspects than rendering,
people should take in account this, and effectively ignore the U+0D41
when followed by U+0D4D, when the task is about searching, sorting, etc.
While this is a nuisance, it does not appear completely prohibitive to
me. But I admit I have not think a lot about the consequences of
allowing such "presentation encoding."


Regards,
Antoine





Re: Proposal to add Bengali Khanda Ta

2002-12-03 Thread Antoine LECA
Hi folks,

This post is a bit long, so here is a resume:
- regarding the encodings of TMA, they are currently several possibilities,
so it should be possible to sort all "normal" cases with current characters.
- however, this shows that ISCII provides a characetr, INV, with no
counter part in Unicode. Perhaps this is the problem to be solved.


Andy White wrote on 2002-11-29 13:21:14Z:


Marco wrote


- Does ISCII have a way to distinguish the two cases above 
and the other possible combinations? I mean:
	1. Ta_Ma_Ligature,
	2. Khanda_Ta + Ma,
	3. Half_Ta + Ma,
	4. Ta + Virama + Ma.


1. Ta_Ma_Ligature is simply 'ta virama ma'
2. Khanda_Ta + Ma, is 'ta virama virama ma' (equivalent to 'ta virama zwnj ma')
3. Half_Ta + Ma is 'ta virama inv ma' (equivalent to 'ta virama zwj ma')


I fail to understand why it cannot (also) be coded as 'ta halant nukta ma'
using the "soft-halant" feature of ISCII, which is supposed to do just that
(see IS13194:1991 6.3.2)
I know iLeap (and ISFA in general) renders it incorrectly, but when I read
6.3.2 ("prevents it from combining with the following consonant"), I believe
that the iLeap software is in error here.


4. Ta + Virama + Ma should be 'ta virama virama inv ma' but this is not implemented in the iLeap application I am using!


I got an acceptable result with 'ta inv halant ma'. Of course this is a
complete hack (for example, a romanisation of the result will show the
incorrectness), but for visual purposes ony, it does the job. And since
Ta + visible halant is not supposed to be anything useful for normal writing
(i.e. only useful for school taughing or similar tasks, as I understand
things; at least no Bengali words are supposed to be written this way),
it seems to me


The problem I have, and it is very well synthetised by Andy and Marco here,
is that in ISCII-91 I see *three* mechanisms to vary the rendering
"Explicit halant", coded E8 E8, described in 6.3.1
"Soft Halant", coded E8 E9, described in 6.3.2
"Invisible consonant INV", coded D9, described in 6.4, which further
   may combine with the other two, but is intended only for rendering
   purpose

At the same time, Unicode (3.0) does only provide *two* mechanisms
inserting ZWNJ after virama, called "Explicit Virama"
inserting ZWJ after virama, called "Explicit Half-Consonant"

There is little doubt that "Explicit Virama" and "Explicit Halant" can be
paired: their descriptions are very similar.
However, I remember reading in Unicode 1.0 (unfortunately, I did have it
at hands) that the position at DA (INV consonant, according to ISCII-88)
was equated to the ZWJ. While it might appear correct for some cases,
I do not believe this is correct. The Indic FAQ also has words on the
topic, but there is many things to comment on this FAQ, so I won't
elaborate further (however, if the editor is reading, please contact me.)
I believe ZWJ could be equated to Soft Halant, as the description are 
similar (except the well-known exception of the eyelash-ra, as stated in
Unicode 2.0), despite the important difference in words.
I understand that now Malayalam cillus are to be encoded with ZWJ, too.

As a result, we are left with one code in ISCII-91, INV (D9), which is
indeed quite special (its description makes clear it is not used to write
some sound, it is merely an artefact, useful for specialized tasks), that
ends with no corresponding in Unicode, at least that I may spot at once
(remember, it should be a character that shares the properties of the
"regular" consonants, i.e. ligating before or after virama, or before
vowel signs.) Perhaps, as the discussion above showed, this is really
this character that appears to be missing to perform specialized tasks
with Indic scripts? (such as the Malayalam Half-U that I were speaking about
last month.)

Andy's new proposal, CBM, is a bit different, since it affects precise
rules to solve some cases. The thing that makes me a bit reluctant, is
that there is no previous art with CBM, so we can be wrong a couple of
times, with subsequent rectifications, erratas and change of meaning,
overall bad things. On the other hand, including a new character, with
the same semantics as already present in ISCII, would ease some
conversions (I know it would be few), and also provide a reference to
implement.
Having say that, the first example of Andy, with the relatives priorities
of reph versus jophola (and similar examples between reph and
rakar-vattu/vakar/yakar/lakar) remains to be examined in more details.


Regards,
Antoine




Script of U+0951 .. U+0954

2002-12-04 Thread Antoine LECA
Hi folks,

I recently notice (I was off line for a while) the inclusion of the
Scripts.txt file in the Unicode Character Database. I find it very
interesting.
I noticed it is informative. However, there is a detail that makes
me quite unhappy: characters U+0951 .. U+0954 (the various accents
described in ISCII-88 to mark tones in some Vedic texts) are assigned
to the DEVANAGARI script.

These accents are the only usable at the moment for Vedic/Sanskrit
tone marking, and they do include the probably most interesting ones,
U+0951 and U+0952. However, these two accents are not specific to
Devanagari, and could be used without problem with the other scripts
that may be used to write Sanskrit/Vedic. So I believe they should be
moved to another place, for example the "INHERITED" pseudo-script.

Concerning U+0953 and U+0954 (grave and acute accents), the point is
that they are mostly used with... Latin characters (grave is svarita,
and acute is udatta.) In fact, I believe that it happens with them
the same "problem" than with U+0340 and U+0341, with the exception
that one may contemplate the placement of U+0953/U+0954 above the
middle of the diphtongues ai/au (I do not believe this is standard
usage, though.)
So again, I believe they can be moved to the "INHERITED" pseudo-script.


This issue is of relevance to the rendering engines. For example,
Microsoft's Uniscribe refuse to draw U+095x on top of any syllable
which is not Devanagari. I believe this behaviour is incorrect,
but the Script.txt file seems to assert MS position. If this is right,
(and so if I am wrong), then that means that we need a number of new
characters, for each script used to write Sanskrit/Vedic (including Latin).

Another position could be to say that the combining character from the 
U+03xx range could be used, and equating (in addition to the above)
U+0951 (DEV.STRESS SIGN UDATTA, in fact a svarita accent) with
U+030D (COMB.VERTICAL LINE ABOVE), and
U+0952 (DEV.STRESS SIGN ANUDATTA) with U+0331 (COMB.MACRON BELOW).
Note that this implies changes to the rendering engines as well.

I welcome comments and critisms, 'cause I am very far from being
omniscient on the subject.


Regards,
Antoine





Re: Script of U+0951 .. U+0954

2002-12-05 Thread Antoine LECA
Peter Constable wrote:


There is a potential concern in Uniscribe/OpenType: substitution and
positioning rules in OT are organised hierarchically by script then by
individual writing system / typographic groups (the label used is
languages, but the intent is really groups of writing systems that share
common typographic behaviours). Thus, a rule that handles positioning of a
glyph for 0950 (or whatever) relative to some member of some class of
glyphs must be entered somewhere under some particular script. Now, there
is nothing that prohibits a font developer from creating multiple
positioning rules for 0950 with different classes of base glyphs and to
have a different one placed in the hierarchy under several different
scripts.


Fully agreed so far.


> But there may yet be an issue on the Uniscribe side: given a

string of characters, which it will begin by mapping into a string of
initial glyphs, it has to decide which script tag(s) to apply to portions
of the string. What I don't know is whether it generally assumes combining
marks belong to a specific script, or whether it allows combining marks to
inherit their script from the base characters with which they combine.


Look: in current Uniscribe, leading ZWJ and ZWNJ are discarded (i.e., with
input U+200B U+093E, you still get the circle meaning "incorrect combining",
even if this is perfectly correct Unicode as far as I understand.
So clearly, they have a problem with "backtracking" when the script is
not determined by the first character in stream. I can understand that.
OTOH, when ZWJ or ZWNJ come second or later in conjuncts, they are properly
handled. In every script it is relevant. What I would like to see, is that
the Indic accents be handled in the same way. And when I spoke about that
with MS people (and not only me, but also Pothana's designer), MS answered
that the Unicode standard seemed to imply that these accents apply to
Devanagari script only.
It looks like to me taht this Scripts.txt just confirm the MS point of view.
If this is as intended, that is fine, but that means that a bunch of new
character (with few or no added value) are to be added to some new revision
of Unicode.

By the way, the situation is similar with the dandas (U+0964 and U+0965):
they only appear in the Devanagari and Myanmar blocks, but are used for many
other (all?) South-Asian scripts as well. Worse, they are often used, so
there is already many material that is encoded with these codepoints.
Luckily, dandas do not need special handling from complex script engines,
so it does not matter if Uniscribe decide they are Devanagri or script-less
(except perhaps on the selection of the font).


Antoine





  1   2   3   >