Re: is there any way to change already defined character codes?

2000-08-08 Thread Jianping Yang

Not really for Unicode in which we have relocated some codepoints for Hangul
between Unicode 1.1 and 2.0 :)

Regards,
Jianping.

"Christopher J. Fynn" wrote:

 Sandro

 I'm sure someone official will give you an official answer, but I know the only
 answer you are going to get to your question is NO - there is no way to change
 the encoding point of a character (or to change a character name) once it is in
 the Unicode or ISO 10646 standards. Allowing changes like this would break
 existing implementations of these standards - and of course these standards
 would be useless as standards if they were subject to that kind of change.

 Proposals to encode new characters in the Unicode and ISO 10646 standards have
 to go through a lengthy process of consideration and there is ample opportunity
 to submit comments on any proposal during that process. However once characters
 are finally assigned code points in the Unicode and ISO 10646 standards that's
 it.

 May I ask what is the reason these people from the government of Georgia want
 to change the codepoints of some Georgian characters? There is probably another
 good solution (or solutions) for whatever problem they think would be solved by
 changing encoding points.

 Regards

 - Chris

 "Sandro Karumidze" [EMAIL PROTECTED] wrote:

  There are people from the government of Georgia interested in possibility in
  altering Unicode standard it terms of changing codes for some of Georgian
  characters.

  Does this type of things happen in Consortium and if yes under what
 circumstances.

  If not can you specify in which rules is it defined that this types of
 changes are
  not allowed..

  Thanks in advance for your support,

  Best regards,

  Sandro Karumidze




(no subject)

2000-08-08 Thread C. Janardhana Gupta



I have an application that doesn't include unicode support at all.
Considering this, can I use Uniscribe APIs in my application. The system on
which I want to run my application is Windows 98.

Specifically, is there any relationship between Uniscribe APIs and Unicode,
and if yes, then what exactly it is.

Thanks

C.Janardhana Guptha
Quark, Chandigarh




Re: is there any way to change already defined character codes?

2000-08-08 Thread Michael \(michka\) Kaplan

Sandro,

Are you basically wanting the ordering to be different?

Unicode does not have any expressed or implied warranty that the ordering of
characters will be anything like what a user would expect (how can it, when
even so many languages that use the same scripts have entirely different,
occasionally conflicting, collation rules?

It is up to the software to make the necessary collation rules happen.

For example, in Windows 2000 there are two different sorts supported for
Georgian: "modern" and "traditional." The difference is that modern has four
letters (He, Hie, We, and Har, both Capital and Small) sort at the end of
the alphabet (which I presume corresponds to the sort that you do not
like?), while the traditional sort has:

* He appearing between Zen and Tan
* Hie appearing between Nar and On
* We appearing between Un and Phar
* Har appearing between Xan and Jhan

I presume the above "exceptions" more closely match the sort you would
expect? And if there are more, this would be very valuable information (as
the rules behind all new "sorts" like this are that a valid need to sort
text differently was identified.

As a rule, Unicode order is not intended to be nor does it explicitly decide
to follow any kind of collation rules for code point order.

FWIW, the LCIDs behind these two sorts under Windows 2000 (used in the C
CompareString and the VB StrComp) are:

Traditional: 1079 (0x0437)
Modern: 66615 (0x10437)

michka


- Original Message -
From: "Sandro Karumidze" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Cc: "Unicode List" [EMAIL PROTECTED]
Sent: Tuesday, August 08, 2000 3:26 AM
Subject: Re: is there any way to change already defined character codes?


 Dear Chris,

 Thank you for your answer.

  May I ask what is the reason these people from the government of Georgia
want
  to change the codepoints of some Georgian characters? There is probably
another
  good solution (or solutions) for whatever problem they think would be
solved by
  changing encoding points.

 The issue is that in Unicode there is a  sequence of Georgian caracters
different
 from what this people think should be.

 In modern Georgian there are 33 widely used characters. However before
there were
 38 characters. In beginning of this century 5 characters were dropped,
though still
 used in old texts and by language specialists.

 In Unicode this 5 characters follow 33. There is a different point of view
that
 those 5 should be included among the ohters.

 This is all the issue - there are no specific implementation difficulties
or
 problems. The only point is that 5 among the rest 33 is more "correct".

 Best regards,

 Sandro Karumidze





 
  Regards
 
  - Chris
 
  "Sandro Karumidze" [EMAIL PROTECTED] wrote:
 
   There are people from the government of Georgia interested in
possibility in
   altering Unicode standard it terms of changing codes for some of
Georgian
   characters.
 
   Does this type of things happen in Consortium and if yes under what
  circumstances.
 
   If not can you specify in which rules is it defined that this types of
  changes are
   not allowed..
 
   Thanks in advance for your support,
 
   Best regards,
 
   Sandro Karumidze






Re: is there any way to change already defined character codes?

2000-08-08 Thread John Cowan

On Mon, 7 Aug 2000, Jianping Yang wrote:

 Not really for Unicode in which we have relocated some codepoints for Hangul
 between Unicode 1.1 and 2.0 :)

Yes, but NEVER AGAIN.

-- 
John Cowan   [EMAIL PROTECTED]
C'est la` pourtant que se livre le sens du dire, de ce que, s'y conjuguant
le nyania qui bruit des sexes en compagnie, il supplee a ce qu'entre eux,
de rapport nyait pas.   -- Jacques Lacan, "L'Etourdit"





Re: is there any way to change already defined character codes?

2000-08-08 Thread John Cowan

On Tue, 8 Aug 2000, Sandro Karumidze wrote:

 The issue is that in Unicode there is a  sequence of Georgian caracters different
 from what this people think should be.
 
 In modern Georgian there are 33 widely used characters. However before there were
 38 characters. In beginning of this century 5 characters were dropped, though still
 used in old texts and by language specialists.
 
 In Unicode this 5 characters follow 33. There is a different point of view that
 those 5 should be included among the ohters.
 
 This is all the issue - there are no specific implementation difficulties or
 problems. The only point is that 5 among the rest 33 is more "correct".

Ah, OK.  The order of characters in the Unicode Standard is *not*
meant to be the proper sort order for any language (even English)
or relied on for that purpose.  If any changes are needed, it is to
the Unicode default collating sequence (which I have not checked) and not to
the codes for the characters themselves.

-- 
John Cowan   [EMAIL PROTECTED]
C'est la` pourtant que se livre le sens du dire, de ce que, s'y conjuguant
le nyania qui bruit des sexes en compagnie, il supplee a ce qu'entre eux,
de rapport nyait pas.   -- Jacques Lacan, "L'Etourdit"





Re: Unicode String literals on various platforms

2000-08-08 Thread Antoine Leca

Bob Jones wrote:
 
 In a C program, how do you code Unicode string literals on the following
 platforms:
 NT
 Unix (Sun, AIX, HP-UX)
 AS/400

We devised a solution for this problem in the C99 Standard.
The "solution" is named "UCN", for Universal Character Notation, and 
is essentially to use the (borrowed from Java) \u notation, like
(with Ken's example)

  char C_thai[] = "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";

And similarlywchar_t C_thai[] = L"\u0E40... or
 TCHAR_T C_thai[] = T("\u0E40...
depending on your storing option. See below for more.

The benefit is that now, your C program is portable to any platform
where the C compiler complies to C99.
The drawback is that, nowadays, there is very few such compilers.

 
 Everything I have read says not to use wchar_t for cross platform apps
 because the size is not uniform, i.e. NT it is an unsigned short (2 bytes)
 while on Unix it is an unsigned int (4 bytes).  If you create your own TCHAR
 or whatever, how do you handle string literals? 

A similar problem exists with numbers, doesn't it? And the usual solution
is to *not* exchange data in internal format, but rather to use textual
representations. Agreed?

For a C _program_, where the textual representation are string litteral (rather
that array of integers), C99 UCN is the way to go.

Now, since you are talking of wchar_t vs. other forms of storing characters,
I wonder if you are not asking about the problem of the manipulated _datas_,
as opposed to the C program.

Then, I believe the solution is exactly the same as with numbers: internally
use whatever is the most appropriate to the current platform (the TCHAR_T/T()
solution of Microsoft is nice because it conveniently alternate to either
char or wchar_t depending of compilation options), but when exchanging datas,
change to a common, textual representation.

Look after the %lc %ls options of [w]printf/[w]scanf, to learn on how output/
input wide characters to/from text files. Another solution is to use "Unicode"
files, using some dedicated conversions, pretty much the same as using htons(),
ntohl(), etc. functions when dealing with low-level Internet protocols.


I agree there is currenly lacking a way in the C Standard to indicate that one
would open a text file using a specific encoding protocol (eg. UTF-16LE/BE,
or UTF-8). And the discussion on this matter have ending endless so far.


 On NT L"foobar" gives each character 2 bytes,

Yes

 but on Unix L"foobar" uses 4 bytes per character.

Depends on the compiler. Some are 4 bytes, some are 8 (64-bit boxes), some
are even only 8-bit (and are not Unicode compliant).

 Even worse I suspect is the AS/400 where the string literal is probably in
 EBCDIC.

Perhaps (and even probably, as L'a' is required to be equal to 'a' in C),
but what is the problem? You are not going to memcpy()-ing L"foobar", or
to fwrite()-ing it, are you? And I am sure your AS/400 implementation have
some way to specify on open() that a text file is really an "ASCII", rather
that EBCDIC, file. Or if it does not, it should...



Regards,
Antoine



RE: is there any way to change already defined character codes?

2000-08-08 Thread Peter_Constable


On 08/08/2000 06:40:17 AM Marco.Cimarosti wrote:

(You definitely need an official reply, but let's go on with some more
informal chatting.)

All the "officials" are busy meeting this week, but the statement, "Can't
be done" is just as true whether it comes from the lips (or... fingertips)
of a Ken Whistler or Mark Davis as from a Marco Cimarosti or a Chris Fynn.
There are enough of us on this list that have a solid understanding of the
standard and its development that a question like this can be answered
without waiting for an "official" answer (though this question really ought
to be answered somewhere on the Unicode web site); if somebody were to give
wrong information, there would be several that wouldn't hesitate to
correct.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





RE: Summary: xml:lang validity and RFC 1766 refs to outdated code

2000-08-08 Thread Mike Brown

  XML 1.0 says that xml:lang attributes must match production 33
 
 In fact, not so. Productions 33-38 have no normative value
 whatsoever, as there is neither a production nor normative
 language connecting them with the rest of XML 1.0.
 [...]
 In recognition of this fact, official erratum E73 (at
 http://www.w3.org/XML/xml-19980210-errata#E73) removes these 
 productions from XML 1.0 altogether. It also allows for a
 successor to RFC 1766 when and if such a thing exists.

Correct, but RFC 1766 doesn't, in turn, allow for successors to ISO 639 and
ISO 3166, at least not by a strict interpretation of its formal language.
And to date, there still is no successor to RFC 1766.

E73 says in its rationale "The XML processor does not deal with the value of
xml:lang", but it also says, more formally, "The values of the attribute are
language identifiers as defined by [IETF RFC 1766]".

The use of "are" in that statement sounds as definitive as "must" to me. As
an XML document author, or the programmer of an XML document authoring tool,
tell me, do I or do I not use RFC 1766 language tags/identifiers as xml:lang
values? It seems that XML says I must use them, but it would not a violation
of validity if I didn't use them.

I also don't see how one could read RFC 1766 in such a way as to ignore its
prescription of a finite range of possible values for what it calls a
language tag:

  Language-Tag = Primary-tag *( "-" Subtag )
  Primary-tag = 1*8ALPHA
  Subtag = 1*8ALPHA

In the primary language tag:

 -All 2-letter tags are interpreted according to ISO
  standard 639, "Code for the representation of names
  of languages" [ISO 639].

 [...mention of "i-" and "x-"...]

 -Other values cannot be assigned except by updating
  this standard.

...so the removal of productions 33-38 from XML really just seem to be
intended to allow RFC 1766 and its successors determine the proper
construction of a language tag, which makes more sense than trying to
reiterate the RFC's technical contents in XML's specification. It doesn't
necessarily follow that xml:lang values can avoid conforming to RFC 1766.

[We're on the same side, here. I'm just playing devil's advocate, because
after I heard about this issue and reviewed the specs myself, I found that
there were indeed points of contention.]

-Mike



Re: Zero-width ligator

2000-08-08 Thread Doug Ewell

Peter Constable [EMAIL PROTECTED] wrote:

 I inquired about that recently on the unicoRe list, and was told that
 the semantics of ZWJ/ZWNJ will be extended in 3.0.1 (or maybe it was
 3.1).

Well, that's a good thing.  It sounds like the benefits described by
Everson will be made available in Unicode after all.

 You mentioned that this decision was made at the meeting in February.
 Interestingly, I was at that meeting, and my recollection was that
 extending the semantics of ZWJ/ZWNJ was going to be given further
 consideration, after some people investigated the implications of
 extending the semantics of ZWJ, particularly for Indic scripts. But I
 left before the meeting was over, and the minutes reflect that a
 decision was in fact made (although the weasle word "provisionally"
 is used).

Thanks for the insight on this process.  Somehow I needed more
information than the word "rejected" in the Pipeline table could
offer.  \u263a

-Doug Ewell
 Fullerton, California



RE: Unicode String literals on various

2000-08-08 Thread Marco . Cimarosti

Antoine Leca wrote:
   char C_thai[] = 
 "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";

Would the Unicode values be converted to the local SBCS/MBCS character set?

If yes:

Is the definition of this locale info part of the C99 standard itself, or is
it operating system's locale?

And what happens to Unicode values that cannot be converted in that
character set?

Thanks.
_ Marco



FW:Unicode Font with Special Effects

2000-08-08 Thread Magda Danish (Unicode)



-Original Message-
From: Greg Olsen [mailto:[EMAIL PROTECTED]]
Sent: Friday, August 04, 2000 3:14 AM
To: [EMAIL PROTECTED]
Subject: Font question


Dear Sirs,
My name is Greg Olsen.  I am an Industrial Designer in Irvine California
and I need information.  I am developing a user Interface that I will
hand off to be programmed in C.
The interfaces design has the Arial font, the catch is that there is a
beveled effect on each letter.  I was wondering if there is a UNICode
font that is capable of these type of effects.
The interface is for a medical product that is to be released in
numerous countries and languages.  Any information would be helpful.
Thank you for your time and response,
Greg Olsen

Patton Design
8 Pasteur #170
Irvine, CA  92618
[EMAIL PROTECTED]



Re: is there any way to change already defined character codes?

2000-08-08 Thread John H. Jenkins

At 11:01 PM -0800 8/7/00, Jianping Yang wrote:
Not really for Unicode in which we have relocated some codepoints for Hangul
between Unicode 1.1 and 2.0 :)


And have regretted it ever since. Moving the Hangul and renaming æ
have caused no end of problems.  It was the fact that it was so
disastrous when done once that makes everyone determined not to do it
again.

--
=
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.blueneptune.com/~tseng



Re: Unicode String literals on various

2000-08-08 Thread Antoine Leca

[EMAIL PROTECTED] wrote:
 
 Antoine Leca wrote:
char C_thai[] =
  "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";
 
 Would the Unicode values be converted to the local SBCS/MBCS character set?

In this case, yes (assuming a normal C compiler).

With wchar_t / L"...", they are converted to the local "wide character set",
which happens to be Unicode on most boxes, with the following main exceptions:

- some (cheap) C compilers does not have any special support for wchar_t,
 so it defaults to the same as cahr, and are usually 8 bit;

- with East Asian C compilers, wchar_t are either Unicode or either
 a flat character coding, that is every character whether coded as SBCS or DBCS
 stands, with its nominal, legacy, code, in a 16-bit or 32-bit cell
 (that is different from MBCS in that the ASCII character are stored
 in cells the same width as DBCS characters)

- EBCDIC implementations have their own rules (for obvious reasons), that
 I do not know exactly (I am not sure they are consistent)

C99 also specifies that if __STDC_ISO_10646__ is defined, then the wchar_t
values are the Unicode codepoints (then to learn if it is UTF-16 or UTF-32,
one should look at WCHAR_MAX to learn if wchar_t are 16-bit or 32-bit).


 
 If yes:
 
 Is the definition of this locale info part of the C99 standard itself, or is
 it operating system's locale?

It is "implementation-defined". Which means:
- it is not required in any way by the C99 Standard itself (except if
 __STDC_ISO_10646__ is defined);
- it is required to be stated in full words in the documentation for the compiler;
- it can vary as per compilation options; often the OS's current locale is
 the default value, that can be overriden.

 
 And what happens to Unicode values that cannot be converted in that
 character set?

The compiler is required to fall back to something (it cannot refuse to
compile, nor it can simply drop the character); it is allowed to "fall back"
to different character depending on the typed character, though; so for example,

  #include stdio.h
  int main() {  printf("%ls\n", L"\u00C0 table!");  return 0;  }

Can produce (among others, this is UTF-8 encoded):

À table!
A table!
à table!
 table!



I can continue to dissert on this subject (all of this should finally be
cooked in a FAQ anyway), but I do not want to flood the list with a marginaly
interesting subject.


Antoine



Re: Summary: xml:lang validity and RFC 1766 refs to outdated codes

2000-08-08 Thread John Cowan

Mike Brown wrote:

 Correct, but RFC 1766 doesn't, in turn, allow for successors to ISO 639 and
 ISO 3166, at least not by a strict interpretation of its formal language.
 And to date, there still is no successor to RFC 1766.

Right.  So

span xml:lang="roa"Yn nediwn seint yn llinghedig,
yn nediwn seint yn cor/span

is not proper XML, although it is well-formed, because the language tag
"roa" (Romance, Other) is not legal by RFC 1766.  But when RFC 1766 is
officially revised to include such language tags, it *will* be good XML.
 
 The use of "are" in that statement sounds as definitive as "must" to me.

No, because a violation of a "must" rule is a violation of well-formedness,
requiring the report of a fatal error and draconian error recovery.

 As
 an XML document author, or the programmer of an XML document authoring tool,
 tell me, do I or do I not use RFC 1766 language tags/identifiers as xml:lang
 values?

You do.

 It seems that XML says I must use them, but it would not a violation
 of validity if I didn't use them.

It is a violation of the intent of the xml:lang attribute not to use them.

 ...so the removal of productions 33-38 from XML really just seem to be
 intended to allow RFC 1766 and its successors determine the proper
 construction of a language tag, which makes more sense than trying to
 reiterate the RFC's technical contents in XML's specification.

Just so.

 It doesn't
 necessarily follow that xml:lang values can avoid conforming to RFC 1766.

They cannot avoid it.
 
-- 

Schlingt dreifach einen Kreis um dies! || John Cowan [EMAIL PROTECTED]
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,   || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)



RE: Unicode String literals on various

2000-08-08 Thread Marco . Cimarosti

Hi, Antoine.

 I can continue to dissert on this subject (all of this should 
 finally be
 cooked in a FAQ anyway), but I do not want to flood the list 
 with a marginaly interesting subject.

Merci beaucoup. It was very informative!

Ciao.
Marco

P.S. You should not be so shy: up to date information
about how Unicode may be used in the world's most
important programming language does not sound so
"off topic" or "marginally interesting" to me.

Ciao++
M.



GEORGIAN DIGITs

2000-08-08 Thread 11digitboy

Where are the Georgian digits? I want a set of Georgian
digits so I can use them as counter digits.

--
Robert Lozyniak
Accusplit pedometer manufactures can go suck eggs
My page: http://walk.to/11
[EMAIL PROTECTED] - email
(917) 421-3909 x1133 - voicemail/fax



___
Get your own FREE Bolt Onebox - FREE voicemail, email, and
fax, all in one place - sign up at http://www.bolt.com




Re: GEORGIAN DIGITs

2000-08-08 Thread Michael \(michka\) Kaplan

Well, if the language does not have them, you will not find them. Funny how
that works, huh?

michka

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/

- Original Message -
From: [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Tuesday, August 08, 2000 4:24 PM
Subject: GEORGIAN DIGITs


 Where are the Georgian digits? I want a set of Georgian
 digits so I can use them as counter digits.

 --
 Robert Lozyniak
 Accusplit pedometer manufactures can go suck eggs
 My page: http://walk.to/11
 [EMAIL PROTECTED] - email
 (917) 421-3909 x1133 - voicemail/fax



 ___
 Get your own FREE Bolt Onebox - FREE voicemail, email, and
 fax, all in one place - sign up at http://www.bolt.com






Re: is there any way to change already defined character codes?

2000-08-08 Thread Michael \(michka\) Kaplan

From: [EMAIL PROTECTED]
  [EMAIL PROTECTED] wrote:
  E.g., if you look at the Latin part, you see that
  the 26 letters used in
  modern English are all contiguously ordered in
  two areas: U0041 to U005A
  (uppercase) and U0061 to U007A (lowercase).
 
 Yeah, but so what? All you gotta do is turn the 6th
 bit off and there you go!
  
  But that's the end of the story! All the other
  100's Latin letters are
  scattered all over, using no consistent order.
  
 Too bad unicode values can't be fractions!!

Lets take this one offline, Robert.

michka








Why not to move characters (was: is there any way to change already defined character codes?)

2000-08-08 Thread 11digitboy

You don't want to move characters because then you
could change the meaning of a sentence that way.
I don't want to price something at 1000 cows when
I mean 1000 yen. Or worse, 100 yen.

___
Get your own FREE Bolt Onebox - FREE voicemail, email, and
fax, all in one place - sign up at http://www.bolt.com