RE: Summary: xml:lang validity and RFC 1766 refs to outdated

2000-08-08 Thread Doug Ewell

Mike Brown <[EMAIL PROTECTED]> wrote:

> Correct, but RFC 1766 doesn't, in turn, allow for successors to ISO
> 639 and ISO 3166, at least not by a strict interpretation of its
> formal language.  And to date, there still is no successor to RFC
> 1766.

The successors to ISO 639 and ISO 3166 are newer versions of 639 and
3166.  ISO standards don't get new numbers when they are revised, as
RFCs do.

I don't see anything in RFC 1766 that hardcodes it to the 1988 versions
of either 639 or 3166.  The 1988 versions are cited in the "References"
section, but that is just to provide a bibliographically complete
citation based on the versions available at the time (March 1995).

It would make no sense in any event for an application using RFC 1766
(including, but not limited to, XML) to be artificially limited to the
language or country codes set at a fixed point in the past.

-Doug Ewell
 Fullerton, California

Why not to move characters (was: is there any way to change already defined character codes?)

2000-08-08 Thread 11digitboy

You don't want to move characters because then you
could change the meaning of a sentence that way.
I don't want to price something at 1000 cows when
I mean 1000 yen. Or worse, 100 yen.

Get your own FREE Bolt Onebox - FREE voicemail, email, and
fax, all in one place - sign up at

Re: is there any way to change already defined character codes?

2000-08-08 Thread Michael \(michka\) Kaplan

> > E.g., if you look at the Latin part, you see that
> > the 26 letters used in
> > modern English are all contiguously ordered in
> > two areas: U0041 to U005A
> > (uppercase) and U0061 to U007A (lowercase).
> Yeah, but so what? All you gotta do is turn the 6th
> bit off and there you go!
> > 
> > But that's the end of the story! All the other
> > 100's Latin letters are
> > scattered all over, using no consistent order.
> > 
> Too bad unicode values can't be fractions!!

Lets take this one offline, Robert.



2000-08-08 Thread Michael \(michka\) Kaplan

Well, if the language does not have them, you will not find them. Funny how
that works, huh?


Michael Kaplan
Trigeminal Software, Inc.

- Original Message -
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Tuesday, August 08, 2000 4:24 PM

> Where are the Georgian digits? I want a set of Georgian
> digits so I can use them as counter digits.
> --
> Robert Lozyniak
> Accusplit pedometer manufactures can go suck eggs
> My page:
> (917) 421-3909 x1133 - voicemail/fax
> ___
> Get your own FREE Bolt Onebox - FREE voicemail, email, and
> fax, all in one place - sign up at

RE: is there any way to change already defined character codes?

2000-08-08 Thread 11digitboy

Robert Lozyniak
Accusplit pedometer manufactures can go suck eggs
My page:
(917) 421-3909 x1133 - voicemail/fax

> Sandro Karumidze wrote:
> > The issue is that in Unicode there is a  sequence
> of Georgian 
> > caracters different
> > from what this people think should be.
> > [...] In beginning of this century 5 characters
> were dropped
> > [...]
> > In Unicode this 5 characters follow 33. There
> is a different 
> > point of view that those 5 should be included
> among the
> > ohters.
> (You definitely need an official reply, but let's
> go on with some more
> informal chatting.)
> I foresee that this would not be considered a good
> reason to change
> anything.
> The order of characters in Unicode (or in any other
> character encoding) is
> not important. The scope of a character set is
> to assign a unique number to
> each character, not to define an "alphabetical
> order".
Yeah. Just look at the kanji digits!

> If you notice, the situation that you describe
> is true for *all* the
> alphabets in Unicode.
> E.g., if you look at the Latin part, you see that
> the 26 letters used in
> modern English are all contiguously ordered in
> two areas: U0041 to U005A
> (uppercase) and U0061 to U007A (lowercase).

Yeah, but so what? All you gotta do is turn the 6th
bit off and there you go!
> But that's the end of the story! All the other
> 100's Latin letters are
> scattered all over, using no consistent order.
Too bad unicode values can't be fractions!!

> The same is true for Cyrillic, Greek, Hebrew, Arabic,
> and so on. Have a look
> at those blocks: the basic letters for post-czar
> Russian, modern Greek,
> Israeli Hebrew, modern Arabic etc. are consistently
> ordered, but the letters
> for other languages that use the same alphabets
> (or ancient letters for the
> same languages) are scattered all over with no
> specific order.
> The reason why no one cares about the order of
> characters is that it is
> *impossible* to determine a "correct" order.
> In alphabet used by more than one language (e.g.
> Latin, Cyrillic, Arabic,
> Devanagari, etc.), the alphabetic order is normally
> different for each
> language.
> Moreover, many languages have more than one alphabetic
> order, all equally
> valid and in current usage.
> For this reason the problem of "alphabetic order"
> has been pulled apart from
> character sets, and addressed separately.
> In Unicode, the issue of "collation" is handled
> by ad-hoc optional
> algorithm, that is part of the standard but is
> separated from the encoding
> issue itself.
> The algorithm is titled "Unicode Technical Report
> #10: Unicode Collation
> Algorithm", and you can find it here:
> .
> *That* is the place to check whether Georgian Letters
> are in the correct
> order or not. And if they are not, you have two
> options:
> 1) Ask Unicode to change it: here you *do* have
> some chances to be listened,
> if you have valid arguments.
> 2) Change it yourself: unlike the character values,
> the collation algorithm
> is designed to be flexible and customizable.
> Regards,
> _ Marco

Get your own FREE Bolt Onebox - FREE voicemail, email, and
fax, all in one place - sign up at


2000-08-08 Thread 11digitboy

Where are the Georgian digits? I want a set of Georgian
digits so I can use them as counter digits.

Robert Lozyniak
Accusplit pedometer manufactures can go suck eggs
My page:
(917) 421-3909 x1133 - voicemail/fax

Get your own FREE Bolt Onebox - FREE voicemail, email, and
fax, all in one place - sign up at

Re: is there any way to change already defined character codes?

2000-08-08 Thread Rick McGowan

The question is:

> Is there any way to change already defined character codes?

And the definitive answer is "No".

Marco Cimarosti wrote:

> (You definitely need an official reply, but let's go on with some more
> informal chatting.)

OK, here is another semi-official reply from me, as a UTC member, since everyone else 
seems to be at the UTC meeting this week...

As far as I know, neither WG2 nor UTC would vote to re-order the Georgian alphabet 
because that would invalidate all existing data.  Neither WG2 nor UTC would remove or 
move any existing characters for the same reason.  Use a tailored sorting table if you 
need a different ordering.

Jianping Yang wrote:

> Not really for Unicode in which we have relocated some codepoints
> for Hangul between Unicode 1.1 and 2.0 :)

The fact that there was a re-ordering in Hangul some years ago was a travesty and an 
embarrassment that nobody wants to repeat.



RE: Unicode String literals on various

2000-08-08 Thread Marco . Cimarosti

Hi, Antoine.

> I can continue to dissert on this subject (all of this should 
> finally be
> cooked in a FAQ anyway), but I do not want to flood the list 
> with a marginaly interesting subject.

Merci beaucoup. It was very informative!


P.S. You should not be so shy: up to date information
about how Unicode may be used in the world's most
important programming language does not sound so
"off topic" or "marginally interesting" to me.


Re: Summary: xml:lang validity and RFC 1766 refs to outdated codes

2000-08-08 Thread John Cowan

Mike Brown wrote:

> Correct, but RFC 1766 doesn't, in turn, allow for successors to ISO 639 and
> ISO 3166, at least not by a strict interpretation of its formal language.
> And to date, there still is no successor to RFC 1766.

Right.  So

Yn nediwn seint yn llinghedig,
yn nediwn seint yn cor

is not proper XML, although it is well-formed, because the language tag
"roa" (Romance, Other) is not legal by RFC 1766.  But when RFC 1766 is
officially revised to include such language tags, it *will* be good XML.
> The use of "are" in that statement sounds as definitive as "must" to me.

No, because a violation of a "must" rule is a violation of well-formedness,
requiring the report of a fatal error and draconian error recovery.

> As
> an XML document author, or the programmer of an XML document authoring tool,
> tell me, do I or do I not use RFC 1766 language tags/identifiers as xml:lang
> values?

You do.

> It seems that XML says I must use them, but it would not a violation
> of validity if I didn't use them.

It is a violation of the intent of the xml:lang attribute not to use them.

> the removal of productions 33-38 from XML really just seem to be
> intended to allow RFC 1766 and its successors determine the proper
> construction of a language tag, which makes more sense than trying to
> reiterate the RFC's technical contents in XML's specification.

Just so.

> It doesn't
> necessarily follow that xml:lang values can avoid conforming to RFC 1766.

They cannot avoid it.

Schlingt dreifach einen Kreis um dies! || John Cowan <[EMAIL PROTECTED]>
Schliesst euer Aug vor heiliger Schau,  ||
Denn er genoss vom Honig-Tau,   ||
Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)

Re: Unicode String literals on various

2000-08-08 Thread Antoine Leca

> Antoine Leca wrote:
> >   char C_thai[] =
> > "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";
> Would the Unicode values be converted to the local SBCS/MBCS character set?

In this case, yes (assuming a normal C compiler).

With wchar_t / L"...", they are converted to the local "wide character set",
which happens to be Unicode on most boxes, with the following main exceptions:

- some (cheap) C compilers does not have any special support for wchar_t,
 so it defaults to the same as cahr, and are usually 8 bit;

- with East Asian C compilers, wchar_t are either Unicode or either
 a flat character coding, that is every character whether coded as SBCS or DBCS
 stands, with its nominal, legacy, code, in a 16-bit or 32-bit cell
 (that is different from MBCS in that the ASCII character are stored
 in cells the same width as DBCS characters)

- EBCDIC implementations have their own rules (for obvious reasons), that
 I do not know exactly (I am not sure they are consistent)

C99 also specifies that if __STDC_ISO_10646__ is defined, then the wchar_t
values are the Unicode codepoints (then to learn if it is UTF-16 or UTF-32,
one should look at WCHAR_MAX to learn if wchar_t are 16-bit or 32-bit).

> If yes:
> Is the definition of this locale info part of the C99 standard itself, or is
> it operating system's locale?

It is "implementation-defined". Which means:
- it is not required in any way by the C99 Standard itself (except if
 __STDC_ISO_10646__ is defined);
- it is required to be stated in full words in the documentation for the compiler;
- it can vary as per compilation options; often the OS's current locale is
 the default value, that can be overriden.

> And what happens to Unicode values that cannot be converted in that
> character set?

The compiler is required to fall back to something (it cannot refuse to
compile, nor it can simply drop the character); it is allowed to "fall back"
to different character depending on the typed character, though; so for example,

  int main() {  printf("%ls\n", L"\u00C0 table!");  return 0;  }

Can produce (among others, this is UTF-8 encoded):

À table!
A table!
à table!
 table!

I can continue to dissert on this subject (all of this should finally be
cooked in a FAQ anyway), but I do not want to flood the list with a marginaly
interesting subject.


Re: is there any way to change already defined character codes?

2000-08-08 Thread John H. Jenkins

At 11:01 PM -0800 8/7/00, Jianping Yang wrote:
>Not really for Unicode in which we have relocated some codepoints for Hangul
>between Unicode 1.1 and 2.0 :)

And have regretted it ever since. Moving the Hangul and renaming æ
have caused no end of problems.  It was the fact that it was so
disastrous when done once that makes everyone determined not to do it

John H. Jenkins

FW:Unicode Font with Special Effects

2000-08-08 Thread Magda Danish (Unicode)

-Original Message-
From: Greg Olsen [mailto:[EMAIL PROTECTED]]
Sent: Friday, August 04, 2000 3:14 AM
Subject: Font question

Dear Sirs,
My name is Greg Olsen.  I am an Industrial Designer in Irvine California
and I need information.  I am developing a user Interface that I will
hand off to be programmed in C.
The interfaces design has the Arial font, the catch is that there is a
beveled effect on each letter.  I was wondering if there is a UNICode
font that is capable of these type of effects.
The interface is for a medical product that is to be released in
numerous countries and languages.  Any information would be helpful.
Thank you for your time and response,
Greg Olsen

Patton Design
8 Pasteur #170
Irvine, CA  92618

RE: Unicode String literals on various

2000-08-08 Thread Marco . Cimarosti

Antoine Leca wrote:
>   char C_thai[] = 
> "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";

Would the Unicode values be converted to the local SBCS/MBCS character set?

If yes:

Is the definition of this locale info part of the C99 standard itself, or is
it operating system's locale?

And what happens to Unicode values that cannot be converted in that
character set?

_ Marco

Re: Zero-width ligator

2000-08-08 Thread Doug Ewell

Peter Constable <[EMAIL PROTECTED]> wrote:

> I inquired about that recently on the unicoRe list, and was told that
> the semantics of ZWJ/ZWNJ will be extended in 3.0.1 (or maybe it was
> 3.1).

Well, that's a good thing.  It sounds like the benefits described by
Everson will be made available in Unicode after all.

> You mentioned that this decision was made at the meeting in February.
> Interestingly, I was at that meeting, and my recollection was that
> extending the semantics of ZWJ/ZWNJ was going to be given further
> consideration, after some people investigated the implications of
> extending the semantics of ZWJ, particularly for Indic scripts. But I
> left before the meeting was over, and the minutes reflect that a
> decision was in fact made (although the weasle word "provisionally"
> is used).

Thanks for the insight on this process.  Somehow I needed more
information than the word "rejected" in the Pipeline table could
offer.  \u263a

-Doug Ewell
 Fullerton, California

RE: Summary: xml:lang validity and RFC 1766 refs to outdated code

2000-08-08 Thread Mike Brown

> > XML 1.0 says that xml:lang attributes must match production 33
> In fact, not so. Productions 33-38 have no normative value
> whatsoever, as there is neither a production nor normative
> language connecting them with the rest of XML 1.0.
> [...]
> In recognition of this fact, official erratum E73 (at
> removes these 
> productions from XML 1.0 altogether. It also allows for a
> successor to RFC 1766 when and if such a thing exists.

Correct, but RFC 1766 doesn't, in turn, allow for successors to ISO 639 and
ISO 3166, at least not by a strict interpretation of its formal language.
And to date, there still is no successor to RFC 1766.

E73 says in its rationale "The XML processor does not deal with the value of
xml:lang", but it also says, more formally, "The values of the attribute are
language identifiers as defined by [IETF RFC 1766]".

The use of "are" in that statement sounds as definitive as "must" to me. As
an XML document author, or the programmer of an XML document authoring tool,
tell me, do I or do I not use RFC 1766 language tags/identifiers as xml:lang
values? It seems that XML says I must use them, but it would not a violation
of validity if I didn't use them.

I also don't see how one could read RFC 1766 in such a way as to ignore its
prescription of a finite range of possible values for what it calls a
language tag:

  Language-Tag = Primary-tag *( "-" Subtag )
  Primary-tag = 1*8ALPHA
  Subtag = 1*8ALPHA

In the primary language tag:

 -All 2-letter tags are interpreted according to ISO
  standard 639, "Code for the representation of names
  of languages" [ISO 639].

 [...mention of "i-" and "x-"...]

 -Other values cannot be assigned except by updating
  this standard. the removal of productions 33-38 from XML really just seem to be
intended to allow RFC 1766 and its successors determine the proper
construction of a language tag, which makes more sense than trying to
reiterate the RFC's technical contents in XML's specification. It doesn't
necessarily follow that xml:lang values can avoid conforming to RFC 1766.

[We're on the same side, here. I'm just playing devil's advocate, because
after I heard about this issue and reviewed the specs myself, I found that
there were indeed points of contention.]


RE: Summary: xml:lang validity and RFC 1766 refs to outdated code

2000-08-08 Thread Mike Brown

Jonathan Borden wrote:
> > the 2-letter language code portion of xml:lang values must
> > not only be 2 ASCII characters, but...
> Actually production [34] states that the LangCode is one of:
>   ISO639Code | IanaCode | UserCode

I know that. I also knew that productions 33 through 38 had been made
obsolete by an erratum and that the only normative reference for xml:lang
values was RFC 1766. That's why I said the 2-letter language code *portion*
of xml:lang values. This is in reference to those RFC 1766 conforming
language identifiers that include 2-letter language codes. The text of RFC
1766 describes exactly when and where those codes are to be used.

RE: is there any way to change already defined character codes?

2000-08-08 Thread Peter_Constable

On 08/08/2000 06:40:17 AM Marco.Cimarosti wrote:

>(You definitely need an official reply, but let's go on with some more
>informal chatting.)

All the "officials" are busy meeting this week, but the statement, "Can't
be done" is just as true whether it comes from the lips (or... fingertips)
of a Ken Whistler or Mark Davis as from a Marco Cimarosti or a Chris Fynn.
There are enough of us on this list that have a solid understanding of the
standard and its development that a question like this can be answered
without waiting for an "official" answer (though this question really ought
to be answered somewhere on the Unicode web site); if somebody were to give
wrong information, there would be several that wouldn't hesitate to

- Peter

Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485

Re: Zero-width ligator

2000-08-08 Thread Peter_Constable

I inquired about that recently on the unicoRe list, and was told that the
semantics of ZWJ/ZWNJ will be extended in 3.0.1 (or maybe it was 3.1). You
mentioned that this decision was made at the meeting in February.
Interestingly, I was at that meeting, and my recollection was that
extending the semantics of ZWJ/ZWNJ was going to be given further
consideration, after some people investigated the implications of extending
the semantics of ZWJ, particularly for Indic scripts. But I left before the
meeting was over, and the minutes reflect that a decision was in fact made
(although the weasle word "provisionally" is used).

- Peter

Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485

Re: thanks is there any way to change already defined character codes?

2000-08-08 Thread Michael \(michka\) Kaplan

Does the Traditional sort order I mentioned meet the needs of typical usage?
Or are there sorting rules that are missing?

I am slowly learning Georgian and have a localizer who I work with as well,
but I have much to learn and he makes many allowances for my ignorance (so
he may not be as quick to correct me when I am missing something!).


- Original Message -
From: "Sandro Karumidze" <[EMAIL PROTECTED]>
Sent: Tuesday, August 08, 2000 7:10 AM
Subject: thanks is there any way to change already defined character codes?

> thank you for information.
> I completely agree with you that codes should not be changes. I just
wanted to
> know more about rules.
> best regards,
> Sandro Karumidze
> "Michael (michka) Kaplan" wrote:
> > Sandro,
> >
> > Are you basically wanting the ordering to be different?
> >
> > Unicode does not have any expressed or implied warranty that the
ordering of
> > characters will be anything like what a user would expect (how can it,
> > even so many languages that use the same scripts have entirely
> > occasionally conflicting, collation rules?
> >
> > It is up to the software to make the necessary collation rules happen.
> >
> > For example, in Windows 2000 there are two different sorts supported for
> > Georgian: "modern" and "traditional." The difference is that modern has
> > letters (He, Hie, We, and Har, both Capital and Small) sort at the end
> > the alphabet (which I presume corresponds to the sort that you do not
> > like?), while the traditional sort has:
> >
> > * He appearing between Zen and Tan
> > * Hie appearing between Nar and On
> > * We appearing between Un and Phar
> > * Har appearing between Xan and Jhan
> >
> > I presume the above "exceptions" more closely match the sort you would
> > expect? And if there are more, this would be very valuable information
> > the rules behind all new "sorts" like this are that a valid need to sort
> > text differently was identified.
> >
> > As a rule, Unicode order is not intended to be nor does it explicitly
> > to follow any kind of collation rules for code point order.
> >
> > FWIW, the LCIDs behind these two sorts under Windows 2000 (used in the C
> > CompareString and the VB StrComp) are:
> >
> > Traditional: 1079 (0x0437)
> > Modern: 66615 (0x10437)
> >
> > michka
> >
> > - Original Message -
> > From: "Sandro Karumidze" <[EMAIL PROTECTED]>
> > To: "Unicode List" <[EMAIL PROTECTED]>
> > Cc: "Unicode List" <[EMAIL PROTECTED]>
> > Sent: Tuesday, August 08, 2000 3:26 AM
> > Subject: Re: is there any way to change already defined character codes?
> >
> > > Dear Chris,
> > >
> > > Thank you for your answer.
> > >
> > > > May I ask what is the reason these people from the government of
> > want
> > > > to change the codepoints of some Georgian characters? There is
> > another
> > > > good solution (or solutions) for whatever problem they think would
> > solved by
> > > > changing encoding points.
> > >
> > > The issue is that in Unicode there is a  sequence of Georgian
> > different
> > > from what this people think should be.
> > >
> > > In modern Georgian there are 33 widely used characters. However before
> > there were
> > > 38 characters. In beginning of this century 5 characters were dropped,
> > though still
> > > used in old texts and by language specialists.
> > >
> > > In Unicode this 5 characters follow 33. There is a different point of
> > that
> > > those 5 should be included among the ohters.
> > >
> > > This is all the issue - there are no specific implementation
> > or
> > > problems. The only point is that 5 among the rest 33 is more
> > >
> > > Best regards,
> > >
> > > Sandro Karumidze
> > >
> > >
> > >
> > >
> > >
> > > >
> > > > Regards
> > > >
> > > > - Chris
> > > >
> > > > "Sandro Karumidze" <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > There are people from the government of Georgia interested in
> > possibility in
> > > > > altering Unicode standard it terms of changing codes for some of
> > Georgian
> > > > > characters.
> > > >
> > > > > Does this type of things happen in Consortium and if yes under
> > > > circumstances.
> > > >
> > > > > If not can you specify in which rules is it defined that this
types of
> > > > changes are
> > > > > not allowed..
> > > >
> > > > > Thanks in advance for your support,
> > > >
> > > > > Best regards,
> > > >
> > > > > Sandro Karumidze
> > >
> > >

thanks is there any way to change already defined character codes?

2000-08-08 Thread Sandro Karumidze

thank you for information.

I completely agree with you that codes should not be changes. I just wanted to
know more about rules.

best regards,

Sandro Karumidze

"Michael (michka) Kaplan" wrote:

> Sandro,
> Are you basically wanting the ordering to be different?
> Unicode does not have any expressed or implied warranty that the ordering of
> characters will be anything like what a user would expect (how can it, when
> even so many languages that use the same scripts have entirely different,
> occasionally conflicting, collation rules?
> It is up to the software to make the necessary collation rules happen.
> For example, in Windows 2000 there are two different sorts supported for
> Georgian: "modern" and "traditional." The difference is that modern has four
> letters (He, Hie, We, and Har, both Capital and Small) sort at the end of
> the alphabet (which I presume corresponds to the sort that you do not
> like?), while the traditional sort has:
> * He appearing between Zen and Tan
> * Hie appearing between Nar and On
> * We appearing between Un and Phar
> * Har appearing between Xan and Jhan
> I presume the above "exceptions" more closely match the sort you would
> expect? And if there are more, this would be very valuable information (as
> the rules behind all new "sorts" like this are that a valid need to sort
> text differently was identified.
> As a rule, Unicode order is not intended to be nor does it explicitly decide
> to follow any kind of collation rules for code point order.
> FWIW, the LCIDs behind these two sorts under Windows 2000 (used in the C
> CompareString and the VB StrComp) are:
> Traditional: 1079 (0x0437)
> Modern: 66615 (0x10437)
> michka
> - Original Message -
> From: "Sandro Karumidze" <[EMAIL PROTECTED]>
> To: "Unicode List" <[EMAIL PROTECTED]>
> Cc: "Unicode List" <[EMAIL PROTECTED]>
> Sent: Tuesday, August 08, 2000 3:26 AM
> Subject: Re: is there any way to change already defined character codes?
> > Dear Chris,
> >
> > Thank you for your answer.
> >
> > > May I ask what is the reason these people from the government of Georgia
> want
> > > to change the codepoints of some Georgian characters? There is probably
> another
> > > good solution (or solutions) for whatever problem they think would be
> solved by
> > > changing encoding points.
> >
> > The issue is that in Unicode there is a  sequence of Georgian caracters
> different
> > from what this people think should be.
> >
> > In modern Georgian there are 33 widely used characters. However before
> there were
> > 38 characters. In beginning of this century 5 characters were dropped,
> though still
> > used in old texts and by language specialists.
> >
> > In Unicode this 5 characters follow 33. There is a different point of view
> that
> > those 5 should be included among the ohters.
> >
> > This is all the issue - there are no specific implementation difficulties
> or
> > problems. The only point is that 5 among the rest 33 is more "correct".
> >
> > Best regards,
> >
> > Sandro Karumidze
> >
> >
> >
> >
> >
> > >
> > > Regards
> > >
> > > - Chris
> > >
> > > "Sandro Karumidze" <[EMAIL PROTECTED]> wrote:
> > >
> > > > There are people from the government of Georgia interested in
> possibility in
> > > > altering Unicode standard it terms of changing codes for some of
> Georgian
> > > > characters.
> > >
> > > > Does this type of things happen in Consortium and if yes under what
> > > circumstances.
> > >
> > > > If not can you specify in which rules is it defined that this types of
> > > changes are
> > > > not allowed..
> > >
> > > > Thanks in advance for your support,
> > >
> > > > Best regards,
> > >
> > > > Sandro Karumidze
> >
> >

Re: Unicode String literals on various platforms

2000-08-08 Thread Antoine Leca

Bob Jones wrote:
> In a C program, how do you code Unicode string literals on the following
> platforms:
> NT
> Unix (Sun, AIX, HP-UX)
> AS/400

We devised a solution for this problem in the C99 Standard.
The "solution" is named "UCN", for Universal Character Notation, and 
is essentially to use the (borrowed from Java) \u notation, like
(with Ken's example)

  char C_thai[] = "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";

And similarlywchar_t C_thai[] = L"\u0E40... or
 TCHAR_T C_thai[] = T("\u0E40...
depending on your storing option. See below for more.

The benefit is that now, your C program is portable to any platform
where the C compiler complies to C99.
The drawback is that, nowadays, there is very few such compilers.

> Everything I have read says not to use wchar_t for cross platform apps
> because the size is not uniform, i.e. NT it is an unsigned short (2 bytes)
> while on Unix it is an unsigned int (4 bytes).  If you create your own TCHAR
> or whatever, how do you handle string literals? 

A similar problem exists with numbers, doesn't it? And the usual solution
is to *not* exchange data in internal format, but rather to use textual
representations. Agreed?

For a C _program_, where the textual representation are string litteral (rather
that array of integers), C99 UCN is the way to go.

Now, since you are talking of wchar_t vs. other forms of storing characters,
I wonder if you are not asking about the problem of the manipulated _datas_,
as opposed to the C program.

Then, I believe the solution is exactly the same as with numbers: internally
use whatever is the most appropriate to the current platform (the TCHAR_T/T()
solution of Microsoft is nice because it conveniently alternate to either
char or wchar_t depending of compilation options), but when exchanging datas,
change to a common, textual representation.

Look after the %lc %ls options of [w]printf/[w]scanf, to learn on how output/
input wide characters to/from text files. Another solution is to use "Unicode"
files, using some dedicated conversions, pretty much the same as using htons(),
ntohl(), etc. functions when dealing with low-level Internet protocols.

I agree there is currenly lacking a way in the C Standard to indicate that one
would open a text file using a specific encoding protocol (eg. UTF-16LE/BE,
or UTF-8). And the discussion on this matter have ending endless so far.

> On NT L"foobar" gives each character 2 bytes,


> but on Unix L"foobar" uses 4 bytes per character.

Depends on the compiler. Some are 4 bytes, some are 8 (64-bit boxes), some
are even only 8-bit (and are not Unicode compliant).

> Even worse I suspect is the AS/400 where the string literal is probably in

Perhaps (and even probably, as L'a' is required to be equal to 'a' in C),
but what is the problem? You are not going to memcpy()-ing L"foobar", or
to fwrite()-ing it, are you? And I am sure your AS/400 implementation have
some way to specify on open() that a text file is really an "ASCII", rather
that EBCDIC, file. Or if it does not, it should...


Re: is there any way to change already defined character codes?

2000-08-08 Thread John Cowan

On Tue, 8 Aug 2000, Sandro Karumidze wrote:

> The issue is that in Unicode there is a  sequence of Georgian caracters different
> from what this people think should be.
> In modern Georgian there are 33 widely used characters. However before there were
> 38 characters. In beginning of this century 5 characters were dropped, though still
> used in old texts and by language specialists.
> In Unicode this 5 characters follow 33. There is a different point of view that
> those 5 should be included among the ohters.
> This is all the issue - there are no specific implementation difficulties or
> problems. The only point is that 5 among the rest 33 is more "correct".

Ah, OK.  The order of characters in the Unicode Standard is *not*
meant to be the proper sort order for any language (even English)
or relied on for that purpose.  If any changes are needed, it is to
the Unicode default collating sequence (which I have not checked) and not to
the codes for the characters themselves.

C'est la` pourtant que se livre le sens du dire, de ce que, s'y conjuguant
le nyania qui bruit des sexes en compagnie, il supplee a ce qu'entre eux,
de rapport nyait pas.   -- Jacques Lacan, "L'Etourdit"

Re: is there any way to change already defined character codes?

2000-08-08 Thread John Cowan

On Mon, 7 Aug 2000, Jianping Yang wrote:

> Not really for Unicode in which we have relocated some codepoints for Hangul
> between Unicode 1.1 and 2.0 :)


C'est la` pourtant que se livre le sens du dire, de ce que, s'y conjuguant
le nyania qui bruit des sexes en compagnie, il supplee a ce qu'entre eux,
de rapport nyait pas.   -- Jacques Lacan, "L'Etourdit"

RE: is there any way to change already defined character codes?

2000-08-08 Thread Marco . Cimarosti

Sandro Karumidze wrote:
> The issue is that in Unicode there is a  sequence of Georgian 
> caracters different
> from what this people think should be.
> [...] In beginning of this century 5 characters were dropped
> [...]
> In Unicode this 5 characters follow 33. There is a different 
> point of view that those 5 should be included among the
> ohters.

(You definitely need an official reply, but let's go on with some more
informal chatting.)

I foresee that this would not be considered a good reason to change

The order of characters in Unicode (or in any other character encoding) is
not important. The scope of a character set is to assign a unique number to
each character, not to define an "alphabetical order".

If you notice, the situation that you describe is true for *all* the
alphabets in Unicode.

E.g., if you look at the Latin part, you see that the 26 letters used in
modern English are all contiguously ordered in two areas: U0041 to U005A
(uppercase) and U0061 to U007A (lowercase).

But that's the end of the story! All the other 100's Latin letters are
scattered all over, using no consistent order.

The same is true for Cyrillic, Greek, Hebrew, Arabic, and so on. Have a look
at those blocks: the basic letters for post-czar Russian, modern Greek,
Israeli Hebrew, modern Arabic etc. are consistently ordered, but the letters
for other languages that use the same alphabets (or ancient letters for the
same languages) are scattered all over with no specific order.

The reason why no one cares about the order of characters is that it is
*impossible* to determine a "correct" order.

In alphabet used by more than one language (e.g. Latin, Cyrillic, Arabic,
Devanagari, etc.), the alphabetic order is normally different for each

Moreover, many languages have more than one alphabetic order, all equally
valid and in current usage.

For this reason the problem of "alphabetic order" has been pulled apart from
character sets, and addressed separately.

In Unicode, the issue of "collation" is handled by ad-hoc optional
algorithm, that is part of the standard but is separated from the encoding
issue itself.

The algorithm is titled "Unicode Technical Report #10: Unicode Collation
Algorithm", and you can find it here: .

*That* is the place to check whether Georgian Letters are in the correct
order or not. And if they are not, you have two options:

1) Ask Unicode to change it: here you *do* have some chances to be listened,
if you have valid arguments.

2) Change it yourself: unlike the character values, the collation algorithm
is designed to be flexible and customizable.

_ Marco

Re: is there any way to change already defined character codes?

2000-08-08 Thread Michael \(michka\) Kaplan


Are you basically wanting the ordering to be different?

Unicode does not have any expressed or implied warranty that the ordering of
characters will be anything like what a user would expect (how can it, when
even so many languages that use the same scripts have entirely different,
occasionally conflicting, collation rules?

It is up to the software to make the necessary collation rules happen.

For example, in Windows 2000 there are two different sorts supported for
Georgian: "modern" and "traditional." The difference is that modern has four
letters (He, Hie, We, and Har, both Capital and Small) sort at the end of
the alphabet (which I presume corresponds to the sort that you do not
like?), while the traditional sort has:

* He appearing between Zen and Tan
* Hie appearing between Nar and On
* We appearing between Un and Phar
* Har appearing between Xan and Jhan

I presume the above "exceptions" more closely match the sort you would
expect? And if there are more, this would be very valuable information (as
the rules behind all new "sorts" like this are that a valid need to sort
text differently was identified.

As a rule, Unicode order is not intended to be nor does it explicitly decide
to follow any kind of collation rules for code point order.

FWIW, the LCIDs behind these two sorts under Windows 2000 (used in the C
CompareString and the VB StrComp) are:

Traditional: 1079 (0x0437)
Modern: 66615 (0x10437)


- Original Message -
From: "Sandro Karumidze" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Cc: "Unicode List" <[EMAIL PROTECTED]>
Sent: Tuesday, August 08, 2000 3:26 AM
Subject: Re: is there any way to change already defined character codes?

> Dear Chris,
> Thank you for your answer.
> > May I ask what is the reason these people from the government of Georgia
> > to change the codepoints of some Georgian characters? There is probably
> > good solution (or solutions) for whatever problem they think would be
solved by
> > changing encoding points.
> The issue is that in Unicode there is a  sequence of Georgian caracters
> from what this people think should be.
> In modern Georgian there are 33 widely used characters. However before
there were
> 38 characters. In beginning of this century 5 characters were dropped,
though still
> used in old texts and by language specialists.
> In Unicode this 5 characters follow 33. There is a different point of view
> those 5 should be included among the ohters.
> This is all the issue - there are no specific implementation difficulties
> problems. The only point is that 5 among the rest 33 is more "correct".
> Best regards,
> Sandro Karumidze
> >
> > Regards
> >
> > - Chris
> >
> > "Sandro Karumidze" <[EMAIL PROTECTED]> wrote:
> >
> > > There are people from the government of Georgia interested in
possibility in
> > > altering Unicode standard it terms of changing codes for some of
> > > characters.
> >
> > > Does this type of things happen in Consortium and if yes under what
> > circumstances.
> >
> > > If not can you specify in which rules is it defined that this types of
> > changes are
> > > not allowed..
> >
> > > Thanks in advance for your support,
> >
> > > Best regards,
> >
> > > Sandro Karumidze

Re: is there any way to change already defined character codes?

2000-08-08 Thread Sandro Karumidze

Dear Chris,

Thank you for your answer.

> May I ask what is the reason these people from the government of Georgia want
> to change the codepoints of some Georgian characters? There is probably another
> good solution (or solutions) for whatever problem they think would be solved by
> changing encoding points.

The issue is that in Unicode there is a  sequence of Georgian caracters different
from what this people think should be.

In modern Georgian there are 33 widely used characters. However before there were
38 characters. In beginning of this century 5 characters were dropped, though still
used in old texts and by language specialists.

In Unicode this 5 characters follow 33. There is a different point of view that
those 5 should be included among the ohters.

This is all the issue - there are no specific implementation difficulties or
problems. The only point is that 5 among the rest 33 is more "correct".

Best regards,

Sandro Karumidze

> Regards
> - Chris
> "Sandro Karumidze" <[EMAIL PROTECTED]> wrote:
> > There are people from the government of Georgia interested in possibility in
> > altering Unicode standard it terms of changing codes for some of Georgian
> > characters.
> > Does this type of things happen in Consortium and if yes under what
> circumstances.
> > If not can you specify in which rules is it defined that this types of
> changes are
> > not allowed..
> > Thanks in advance for your support,
> > Best regards,
> > Sandro Karumidze

(no subject)

2000-08-08 Thread C. Janardhana Gupta

I have an application that doesn't include unicode support at all.
Considering this, can I use Uniscribe APIs in my application. The system on
which I want to run my application is Windows 98.

Specifically, is there any relationship between Uniscribe APIs and Unicode,
and if yes, then what exactly it is.


C.Janardhana Guptha
Quark, Chandigarh

Re: FW: Unicode - Exponent and indication sign

2000-08-08 Thread 11digitboy

Yes. Try the middle of the "20__" range of characters.

Robert Lozyniak
Accusplit pedometer manufactures can go suck eggs
My page:
(917) 421-3909 x1133 - voicemail/fax

 "Magda Danish (Unicode)" <[EMAIL PROTECTED]>
> -Original Message-
> From: Marchand, Gilles [mailto:[EMAIL PROTECTED]]
> Sent: Monday, August 07, 2000 6:33 AM
> Subject: Unicode - Exponent and indication sign
> Hello, 
> we plan to use the ISO LATIN 8859-1
> as our default caracter
> set. A question from a user was: does it support
>exponentiation N2, or
> the indication sign O4 ? If so where can I find
>  the how to use method?
> thank you for listeningn to me. 
> Gilles Marchand 
> UQAM - Library system 

Get your own FREE Bolt Onebox - FREE voicemail, email, and
fax, all in one place - sign up at

Re: is there any way to change already defined character codes?

2000-08-08 Thread Jianping Yang

Not really for Unicode in which we have relocated some codepoints for Hangul
between Unicode 1.1 and 2.0 :)


"Christopher J. Fynn" wrote:

> Sandro
> I'm sure someone official will give you an official answer, but I know the only
> answer you are going to get to your question is NO - there is no way to change
> the encoding point of a character (or to change a character name) once it is in
> the Unicode or ISO 10646 standards. Allowing changes like this would break
> existing implementations of these standards - and of course these standards
> would be useless as standards if they were subject to that kind of change.
> Proposals to encode new characters in the Unicode and ISO 10646 standards have
> to go through a lengthy process of consideration and there is ample opportunity
> to submit comments on any proposal during that process. However once characters
> are finally assigned code points in the Unicode and ISO 10646 standards that's
> it.
> May I ask what is the reason these people from the government of Georgia want
> to change the codepoints of some Georgian characters? There is probably another
> good solution (or solutions) for whatever problem they think would be solved by
> changing encoding points.
> Regards
> - Chris
> "Sandro Karumidze" <[EMAIL PROTECTED]> wrote:
> > There are people from the government of Georgia interested in possibility in
> > altering Unicode standard it terms of changing codes for some of Georgian
> > characters.
> > Does this type of things happen in Consortium and if yes under what
> circumstances.
> > If not can you specify in which rules is it defined that this types of
> changes are
> > not allowed..
> > Thanks in advance for your support,
> > Best regards,
> > Sandro Karumidze