Re: [totally OT] Unicode terminology (was Re: string vs. char [was Re: Java and Unicode])

2000-11-22 Thread 11digitboy

If the difference between "A" and "a" is called "case",
what is the difference between HIRAGANA LETTER YA
and KATAKANA LETTER YA called? (I think either of
those letters would do to describe this with the
new code pages. The description would be enhanced
by liberal application of HIRAGANA-KATAKANA LONG
VOWEL MARK.)

I like "Astral Planes" better.
Will they include INUKTITUT VIGESIMAL DIGITs?

I should have voted Sarasvati for US President. Instead
I voted for Saotome Nodoka.


| ||\ __/__  |   |  _/_   | ||  
/
| _|_  ,--, /   \  /_|  -+- / --- | /
|V T_)| |   |\   |   ||/
_
 \_/   T /  \   /  __/   |   /---  \_/ L/
\


 Marco Cimarosti <[EMAIL PROTECTED]>
wrote:
> David Starner wrote:
> > Sent: 20 Nov 2000, Mon 16.18
> > To: Unicode List
> > Subject: Re: string vs. char [was Re: Java and
> Unicode]
> >
> > On Mon, Nov 20, 2000 at 06:54:27AM -0800, Michael
> (michka)
> > Kaplan wrote:
> > > From: "Marco Cimarosti" <[EMAIL PROTECTED]>
> > >
> > > > the Surrograte (aka "Astral") Planes.
> > >
> > > I believe the UTC has deprecated the term Astral
> planes with extreme
> > > prejudice. HTH!
> >
> > The UTC has chosen not use the term Astral Plane.
> Keeping
> > that in mind,
> > I can chose to use whatever terms I want, realizing
> of course
> > that some
> > may not get my point across. The UTC chose Surrogate
> Planes
> > for perceived
> > functionality and translatability; I chose Astral
> Planes for
> > perceived grace and beauty.
> 
> Well, I am not as angrily pro "Astral Planes" as
> David is, but I too find
> the humorous term prettier than the official one.
> And I used it because I
> think that a few people on this list may still
> find it clearer than the
> official "Surrogate Planes" -- which is more serious
> and descriptive, but
> still relatively new to many.
> 
> Moreover, although my attitude towards the UTC

I thought UTC meant Universal Coordinated Time,
like this: UTC 2000a11l22d13h02m.

> (the "government" of Unicode)
> is much more friendly than my attitude towards
> real governments out there
> (if people like J. Jenkinks or M. Davis were the
> President of the USA this
> would be a much nicer world!), still I don't feel
> quite like obeying to any
> government's orders, prohoibitions or deprecations
> without opposing the due
> resistance.
> 
> 8-) (<-- smiley wearing anti-tear-gas glasses)
> 
> _ Marco
> 
> __
> La mia e-mail è ora: My e-mail is now:
> >>>   marco.cimarostiªeurope.com   <<<
> (Cambiare "ª" in "@")  (Change "ª" to "@")
>  
> 
> __
> FREE Personalized Email at Mail.com
> Sign up at http://www.mail.com/?sr=signup
> 

___
Get your own FREE Bolt Onebox - FREE voicemail, email, and
fax, all in one place - sign up at http://www.bolt.com




Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread Keld Jørn Simonsen

On Mon, Nov 20, 2000 at 09:36:08AM -0800, Mark Davis wrote:
> The UTC will be using the terms "supplementary code points", "supplementary
> characters" and "supplementary planes". The term it is "deprecating with
> extreme prejudice" is "surrogate characters".
> 
> See http://www.unicode.org/glossary/ for more information.

Thats good, as it is more in consistence with IS 10646.
10646 does not use the term "surrogate".

Keld



Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread Antoine Leca

Please keep in mind my sidepoint:

> Antoine Leca wrote:
>
> > Please note that I left aside UTF-16, because I am not clear
> > if 16-bit are adequate or not to code UTF-16 in wchar_t (in other words, if
> > wchar_t can be a multiwide encoding).

Marco Cimarosti wrote [with minor editing to keep it to the point]:
>
> > > wchar_t * _wcschr_32(const wchar_t * s, wint_t c);
> > > wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c);
> > > size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);
> >
> > What is the point?
>  
> But if «c >= 0x1000[0]», then the character would be represented in «s» (an
> UTF-16 string) by a surrogate pair, and the function would thus return the
> address of the *high surrogate*.
> 
> E.g., assuming that «s» is «{0x2190, 0xD800, 0xDC05, 0x2192, 0x}» and
> «c» is 0x10[0]05, both functions would return «&s[1]»: the address of the high
> surrogate 0xD800.

As I said, I am unsure UTF-16 is legal for wchar_t. If it is, I will agree with
you. But the main point of the people that say "UTF-16 is illegal for wchar_t"
is just this one: there are some case that are not handled nicely by the current
API.

 
Antoine



Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread Mark Davis

The UTC will be using the terms "supplementary code points", "supplementary
characters" and "supplementary planes". The term it is "deprecating with
extreme prejudice" is "surrogate characters".

See http://www.unicode.org/glossary/ for more information.

Mark

- Original Message -
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Monday, November 20, 2000 06:54
Subject: Re: string vs. char [was Re: Java and Unicode]


> From: "Marco Cimarosti" <[EMAIL PROTECTED]>
>
> > the Surrograte (aka "Astral") Planes.
>
> I believe the UTC has deprecated the term Astral planes with extreme
> prejudice. HTH!
>
> michka
>
> a new book on internationalization in VB at
> http://www.i18nWithVB.com/
>
>
>




Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread addison

Hi Jani,

I dunno. I oversimplified in that statement about exposing vs. hiding.

ICU "hides" the facts about the Unicode implementation in macros,
specifically a next and previous character macro and various other
fillips. If you look very closely at the function (method) prototypes you
can see that, in fact, a "character" is a 32-bit entity and a string is
made (conditionally) of 16-bit entities. But, as you suggest, ICU makes it
easy to work with (and is set up so that a sufficiently motivated coder
could change the internal encoding).


If you ask a 100 programmers the index of the string, they'll give you the
wrong answer 99 times... because there is little or no I18n training in
the course of becoming a programmer. The members of this list are
continually ground down by the sheer inertia of ignorance (I just gave up
answering one about email... I must have written a response to that
message a bunch of times, but don't have the time or stamina this morning
to go find and rework one of them).


In any case this has been a fun and instructive interlude. As I said in
my initial email, I tend to be a CONSUMER of Unicode APIs rather than a
creator. I haven't written a Unicode support package in quite some time
(and the last one was a UTF-8 hack in C++). It's good to be familiar with
the details, but I find that, as a programmer one typically doesn't fully
comprehend the design decisions until one faces them oneself. As it is, I
ended up changing my design and sample code over the weekend to follow the
suggestions of several on this list who've Been There.

As a side note: one of the problems I faced on this project was the need
to keep the Unicode and locale libraries extremely small (this is an
embedded OS). I would happily have borrowed ICU to actually *be* the
library... but it's too large. I've had to design a tiny (and therefore
quite limited) support library. It's been an interesting experience.

Best Regards,

Addison

===
Addison P. PhillipsPrincipal Consultant
Inter-Locale LLChttp://www.inter-locale.com
Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
===
Globalization Engineering & Consulting Services

On Mon, 20 Nov 2000, Jani Kajala wrote:

> 
> > >The question, I guess, boils down to: put it in the interface, or hide it
> > >in the internals. ICU exposes it. My spec, up to this point, hides it,
> 
> (I'm aware that the original question was about C interfaces so you might consider 
>this a bit out of topic but I just wanted to comment about the exposed encoding)
> 
> I think that exposing encoding in interfaces doesn't do any good. It violates 
>oriented design principles and it is not even intuitive.
> 
> I'd bet that if we take 100 programmers and ask them 'What is this index in context 
>of this string?' in every case we'll get an answer that its of course the nth 
>character position. Nobody who isn't well aware of character encoding will ever think 
>of code units. Thus, it is not intuitive to use indices to point at code units. 
>Especially as Unicode has been so well-marketed as '16-bit character set'.
> 
> Besides, you can always use (C++ style) iterators instead of plain indices without 
>any loss in performance or in syntactic convenience. With an 'iterator' in this I 
>refer to simple encapsulated pointer which behaves just as any C++ Standard Template 
>Library random access iterator but takes encoding into account. Example:
> 
> for ( String::Iterator i = s.begin() ; i != s.end() ; ++i )
> // ith character in s = *i
> // i+nth character in s = i[n]
> 
> The solution works with any encoding as long as string::iterator is defined properly.
> 
> The conclusion that using indices won't make a difference in performance also makes 
>sense if you consider the basic underlying task: If you need random access to a 
>string you need to check for characters spanning over multiple code units. So the 
>task is the same O(n) complexity, using indices won't help a bit. If the user needs 
>the access to arbitrary character he needs to iterate anyway. It is just matter how 
>you want to encapsulate the task.
> 
> 
> Regards,
> Jani Kajala
> http://www.helsinki.fi/~kajala/
> 
> 




[totally OT] Unicode terminology (was Re: string vs. char [was Re: Java and Unicode])

2000-11-20 Thread Marco Cimarosti

David Starner wrote:
> Sent: 20 Nov 2000, Mon 16.18
> To: Unicode List
> Subject: Re: string vs. char [was Re: Java and Unicode]
>
> On Mon, Nov 20, 2000 at 06:54:27AM -0800, Michael (michka)
> Kaplan wrote:
> > From: "Marco Cimarosti" <[EMAIL PROTECTED]>
> >
> > > the Surrograte (aka "Astral") Planes.
> >
> > I believe the UTC has deprecated the term Astral planes with extreme
> > prejudice. HTH!
>
> The UTC has chosen not use the term Astral Plane. Keeping
> that in mind,
> I can chose to use whatever terms I want, realizing of course
> that some
> may not get my point across. The UTC chose Surrogate Planes
> for perceived
> functionality and translatability; I chose Astral Planes for
> perceived grace and beauty.

Well, I am not as angrily pro "Astral Planes" as David is, but I too find
the humorous term prettier than the official one. And I used it because I
think that a few people on this list may still find it clearer than the
official "Surrogate Planes" -- which is more serious and descriptive, but
still relatively new to many.

Moreover, although my attitude towards the UTC (the "government" of Unicode)
is much more friendly than my attitude towards real governments out there
(if people like J. Jenkinks or M. Davis were the President of the USA this
would be a much nicer world!), still I don't feel quite like obeying to any
government's orders, prohoibitions or deprecations without opposing the due
resistance.

8-) (<-- smiley wearing anti-tear-gas glasses)

_ Marco

__
La mia e-mail è ora: My e-mail is now:
>>>   marco.cimarostiªeurope.com   <<<
(Cambiare "ª" in "@")  (Change "ª" to "@")
 

__
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup



Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread John Cowan

David Starner wrote:
> I chose Astral Planes for perceived grace
> and beauty.

Thank you!

-- 
There is / one art   || John Cowan <[EMAIL PROTECTED]>
no more / no less|| http://www.reutershealth.com
to do / all things   || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein



Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread Michael \(michka\) Kaplan

I think the issue is more one of the semantic meaning that terms like
astral, imaginary, irrational, or other such terms bring to the table?

Refusing to potentially insult the people who place importance on the
characters that will be encoded on places on than the BMP is a thing of
grace and beauty (much moreso than the insult would be!).

I think the UTC action is a responsible one.

(Just my two cents)

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: "David Starner" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Monday, November 20, 2000 7:18 AM
Subject: Re: string vs. char [was Re: Java and Unicode]


> On Mon, Nov 20, 2000 at 06:54:27AM -0800, Michael (michka) Kaplan wrote:
> > From: "Marco Cimarosti" <[EMAIL PROTECTED]>
> >
> > > the Surrograte (aka "Astral") Planes.
> >
> > I believe the UTC has deprecated the term Astral planes with extreme
> > prejudice. HTH!
>
> The UTC has chosen not use the term Astral Plane. Keeping that in mind,
> I can chose to use whatever terms I want, realizing of course that some
> may not get my point across. The UTC chose Surrogate Planes for perceived
> functionality and translatability; I chose Astral Planes for perceived
grace
> and beauty.
>
> --
> David Starner - [EMAIL PROTECTED]
> http://dvdeug.dhis.org
> Looking for a Debian developer in the Stillwater, Oklahoma area
> to sign my GPG key
>




Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread David Starner

On Mon, Nov 20, 2000 at 06:54:27AM -0800, Michael (michka) Kaplan wrote:
> From: "Marco Cimarosti" <[EMAIL PROTECTED]>
> 
> > the Surrograte (aka "Astral") Planes.
> 
> I believe the UTC has deprecated the term Astral planes with extreme
> prejudice. HTH!

The UTC has chosen not use the term Astral Plane. Keeping that in mind,
I can chose to use whatever terms I want, realizing of course that some
may not get my point across. The UTC chose Surrogate Planes for perceived
functionality and translatability; I chose Astral Planes for perceived grace
and beauty. 

-- 
David Starner - [EMAIL PROTECTED]
http://dvdeug.dhis.org
Looking for a Debian developer in the Stillwater, Oklahoma area 
to sign my GPG key



Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread Michael \(michka\) Kaplan

From: "Marco Cimarosti" <[EMAIL PROTECTED]>

> the Surrograte (aka "Astral") Planes.

I believe the UTC has deprecated the term Astral planes with extreme
prejudice. HTH!

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/





Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread Marco Cimarosti

Antoine Leca wrote:
> Marco Cimarosti wrote:
> > Actually, C does have different types for characters within
> strings and for
> > characters in isolation.
>
> That is not my point of view.
> There is a special case for 'H', that holds int type rather
> than char, for
> backward compatibility reasons (such as because the first
> versions of C were
> not able to deal correctly with to-be-promoted arguments).
> Similarly, a
> number of (old) functions use int for the character arguments.
> Then, there is the point of view that int represents _either_ a valid
> character, _or_ an error indication (EOF). This is the reason
> that makes
> int used for the return type of fgetc.

OK.

> Outside this, a string is clearly an array of characters, and
> characters are
> stored using the type char (or one of the sign alternatives).
> As a result,
> you can write 'H' either as such, or as "Hello, world!\n"[0].

OK.

> > The type of a string literal (e.g. "Hello world!\n") is
> "array of char",
> > while the type of a character literal (e.g. 'H') is "int".
> >
> > This distinction is generally reflected also in the C
> library, so that you
> > don't get compiler warnings when passing character
> constants to functions.
>
> You need not, since C considers character to be (small)
> integers, which eases
> passing of arguments. This is unrelated to the issue.

OK. I was just describing the background.

> > This distinction has been retained also in the newer "wide character
> > library": "wchar_t" is the wide equivalent of "char", while
> "wint_t" is the
> > wide equivalent of "int".
>
> Not exactly. The wide versions has the same distinction as
> the narrow one for
> the second case above (finding errors), but not for the first
> one (promoting).

OK.

> > The wide version of the examples above is:
> >
> > int fputws(const wchar_t * c, FILE * stream);
> > wint_t fputwc(wint_t c, FILE * stream);
> ^^
> Instead, we have
>   int fputws(const wchar_t * s, FILE * stream);
>   wint_t fputwc(wchar_t c, FILE *stream);
>
> It shows clearly that c cannot hold the WEOF value. OTOH, the
> returned value
> _can_ be the error indication WEOF, so the type is wint_t.

Oops! Sorry.

I had two versions of "wchar.h" at hand: one (lovingly crafted by myself)
had «wchar_t c»; the other one (shipped with Microsoft Visual C++ 6.0) had:

_CRTIMP wint_t __cdecl fputwc(wint_t, FILE *);

I did the mistake to trust the second one. :-)

> Similarly, the type of L'H' is wchar_t. You gave other
> examples in your "But".
>
> > int iswalpha(wint_t c);
>
> Here, the iswalpha is intended to be able to test valid
> characters as well
> as the error indication, so the type is wint_t; here WEOF is
> specifically allowed.

OK.

> > In an Unicode implementation of the "wide character
> library" (wchar.h and
> > wctype.h), this difference may be exploited to use
> different UTF's for
> > strings and characters:
>
> Ah, now we go into the interresting field.
> Please note that I left aside UTF-16, because I am not clear
> if 16-bit are
> adequate or not to code UTF-16 in wchar_t (in other words, if
> wchar_t can be
> a multiwide encoding).
>
> > typedef unsigned short wchar_t;
> > /* UTF-16 character, used within string. */
> >
> > typedef unsigned long  wint_t;
> > /* UTF-32 character, used for handling isolated characters. */
>
> To date, no problem.
>
> > But, unluckily, there is a "but". Type "wchar_t" is *also*
> used for isolated
> > character in a couple of stupid APIs:
>
> See above for another example: fputwc...
>
> > But I think that changing those "wchar_t c" to "wint_t c"
> is a smaller
> > "violence" to the standards than changing them to "const
> wchar_t * c".
>
> ;-)

OK, my trick is dirty as well, just a bit easier to hide. ;-)

> > And you can also implement it in an elegant, quasi-standard way:
> 
> > wchar_t * _wcschr_32(const wchar_t * s, wint_t c);
> > wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c);
> > size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);
>
> What is the point? You cannot pass to these anything other than values
> between 0 (WCHAR_MIN) and WCHAR_MAX anyway. And there are no really
> "interesting" ways to extend the meaning of these functions outside
> this range.
> Or do I miss something?

_wcschr_32 and _wcsrchr_32 would return a pointer to the first (or last)
occurrence of the specified character in the string, just like their
standard counterparts.

But if «c >= 0x1000», then the character would be represented in «s» (an
UTF-16 string) by a surrogate pair, and the function would thus return the
address of the *high surrogate*.

E.g., assuming that «s» is «{0x2190, 0xD800, 0xDC05, 0x2192, 0x}» and
«c» is 0x1005, both functions would return «&s[1]»: the address of the high
surrogate 0xD800.

Similarly for _wcrtomb_32(): assuming that «s» points into an UTF-8 string,
the function would insert in «s» the 3-octets UTF-8 sequence corresponding
to «c».

> > #ifdef PEDANTIC_STANDARD
> > wchar_t * w

Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread Antoine Leca

Marco Cimarosti wrote:
> 
> Actually, C does have different types for characters within strings and for
> characters in isolation.

That is not my point of view.
There is a special case for 'H', that holds int type rather than char, for
backward compatibility reasons (such as because the first versions of C were
not able to deal correctly with to-be-promoted arguments). Similarly, a
number of (old) functions use int for the character arguments.
Then, there is the point of view that int represents _either_ a valid
character, _or_ an error indication (EOF). This is the reason that makes
int used for the return type of fgetc.

Outside this, a string is clearly an array of characters, and characters are
stored using the type char (or one of the sign alternatives). As a result,
you can write 'H' either as such, or as "Hello, world!\n"[0].


> The type of a string literal (e.g. "Hello world!\n") is "array of char",
> while the type of a character literal (e.g. 'H') is "int".
>
> This distinction is generally reflected also in the C library, so that you
> don't get compiler warnings when passing character constants to functions.

You need not, since C considers character to be (small) integers, which eases
passing of arguments. This is unrelated to the issue.

 
> This distinction has been retained also in the newer "wide character
> library": "wchar_t" is the wide equivalent of "char", while "wint_t" is the
> wide equivalent of "int".

Not exactly. The wide versions has the same distinction as the narrow one for
the second case above (finding errors), but not for the first one (promoting).

 
> The wide version of the examples above is:
> 
> int fputws(const wchar_t * c, FILE * stream);
> wint_t fputwc(wint_t c, FILE * stream);
^^
Instead, we have
  int fputws(const wchar_t * s, FILE * stream);
  wint_t fputwc(wchar_t c, FILE *stream);

It shows clearly that c cannot hold the WEOF value. OTOH, the returned value
_can_ be the error indication WEOF, so the type is wint_t.

Similarly, the type of L'H' is wchar_t. You gave other examples in your "But".

 
> int iswalpha(wint_t c);

Here, the iswalpha is intended to be able to test valid characters as well
as the error indication, so the type is wint_t; here WEOF is specifically
allowed.


> In an Unicode implementation of the "wide character library" (wchar.h and
> wctype.h), this difference may be exploited to use different UTF's for
> strings and characters:

Ah, now we go into the interresting field.
Please note that I left aside UTF-16, because I am not clear if 16-bit are
adequate or not to code UTF-16 in wchar_t (in other words, if wchar_t can be
a multiwide encoding).

 
> typedef unsigned short wchar_t;
> /* UTF-16 character, used within string. */
> 
> typedef unsigned long  wint_t;
> /* UTF-32 character, used for handling isolated characters. */

To date, no problem.

 
> But, unluckily, there is a "but". Type "wchar_t" is *also* used for isolated
> character in a couple of stupid APIs:

See above for another example: fputwc...

 
> But I think that changing those "wchar_t c" to "wint_t c" is a smaller
> "violence" to the standards than changing them to "const wchar_t * c".

;-)

> And you can also implement it in an elegant, quasi-standard way:
 
> wchar_t * _wcschr_32(const wchar_t * s, wint_t c);
> wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c);
> size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);

What is the point? You cannot pass to these anything other than values
between 0 (WCHAR_MIN) and WCHAR_MAX anyway. And there are no really
"interesting" ways to extend the meaning of these functions outside
this range.
Or do I miss something?

 
> #ifdef PEDANTIC_STANDARD
> wchar_t * wcschr(const wchar_t * s, wchar_t c);
> wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
> size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
> #else
> #define wcschr  _wcschr_32
> #define wcsrchr _wcsrchr_32
> #define wcrtomb _wcrtomb_32
> #endif



RE: string vs. char [was Re: Java and Unicode]

2000-11-17 Thread addison

Well... I think you're right. I knew that char and string units weren't
really the same thing. My concern was how to make it easy on developers to
use the Unicode API using their "native intelligence".

More thought makes me less certain of my approach. Specifically, as Mark
points out, looping structures are much more ugly when using
pointers. And I still have to have all of the code for the scalar value
conversion. It's better to force a casting macro on the developers than
trying to do their dirty work for them. Fooling developers who use your
API is a Bad Idea, usually.

Thanks again for the feedback.

Addison

===
Addison P. PhillipsPrincipal Consultant
Inter-Locale LLChttp://www.inter-locale.com
Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
===
Globalization Engineering & Consulting Services

On Fri, 17 Nov 2000, Marco Cimarosti wrote:

> Addison P. Phillips wrote:
> > I ended up deciding that the Unicode API for this OS will only work in
> > strings. CTYPE replacement functions (such as isalpha) and
> > character based
> > replacement functions (such as strchr) will take and return
> > strings for
> > all of their arguments.
> >
> > Internally, my functions are converting the pointed character to its
> > scalar value (to look it up in the database most efficiently).
> >
> > This isn't very satisfying. It goes somewhat against the grain of 'C'
> > programming. But it's equally unsatisfying to use a 32-bit
> > representation
> > for a character and a 16-bit representation for a string,
> > because in 'C',
> > a string *is* an array of characters. Which is more
> > natural? Which is more common? Iterating across an array of
> > 16-bit values
> > or
> 
> Actually, C does have different types for characters within strings and for
> characters in isolation.
> 
> The type of a string literal (e.g. "Hello world!\n") is "array of char",
> while the type of a character literal (e.g. 'H') is "int".
> 
> This distinction is generally reflected also in the C library, so that you
> don't get compiler warnings when passing character constants to functions.
> 
> E.g., compare the following functions from :
> 
> int fputs(const char * s, FILE * stream);
> int fputc(int c, FILE * stream);
> 
> The same convention is generally used through the C library, not only in the
> I/O functions. E.g.:
> 
> int isalpha(int c);
> int tolower(int c);
> 
> This distinction has been retained also in the newer "wide character
> library": "wchar_t" is the wide equivalent of "char", while "wint_t" is the
> wide equivalent of "int".
> 
> The wide version of the examples above is:
> 
> int fputws(const wchar_t * c, FILE * stream);
> wint_t fputwc(wint_t c, FILE * stream);
> 
> int iswalpha(wint_t c);
> wint_t towlower(wint_t c);
> 
> In an Unicode implementation of the "wide character library" (wchar.h and
> wctype.h), this difference may be exploited to use different UTF's for
> strings and characters:
> 
> typedef unsigned short wchar_t;
> /* UTF-16 character, used within string. */
> 
> typedef unsigned long  wint_t;
> /* UTF-32 character, used for handling isolated characters. */
> 
> But, unluckily, there is a "but". Type "wchar_t" is *also* used for isolated
> character in a couple of stupid APIs:
> 
> wchar_t * wcschr(const wchar_t * s, wchar_t c);
> wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
> size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
> 
> BTW, the blunder in wcschr() and wcsrchr() is inherited from their "narrow"
> ancestors: strchr() and strrchr().
> 
> But I think that changing those "wchar_t c" to "wint_t c" is a smaller
> "violence" to the standards than changing them to "const wchar_t * c".
> And you can also implement it in an elegant, quasi-standard way:
> 
> wchar_t * _wcschr_32(const wint_t * s, wchar_t c);
> wchar_t * _wcsrchr_32(const wint_t * s, wchar_t c);
> size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);
> 
> #ifdef PEDANTIC_STANDARD
> wchar_t * wcschr(const wchar_t * s, wchar_t c);
> wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
> size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
> #else
> #define wcschr  _wcschr_32
> #define wcsrchr _wcsrchr_32
> #define wcrtomb _wcrtomb_32
> #endif
> 
> I would like to see the opinion of C standardization experts (e.g. A. Leca)
> about this forcing of the C standard.
> 
> _ Marco.
> 
> __
> La mia e-mail è ora: My e-mail is now:
> >>>   marco.cimarostiªeurope.com   <<<
> (Cambiare "ª" in "@")  (Change "ª" to "@")
>  
> 
> __
> FREE Personalized Email at Mail.com
> Sign up at http://www.mail.com/?sr=signup
> 




Re: string vs. char [was Re: Java and Unicode]

2000-11-17 Thread addison

Thanks Mark. I've looked extensively at the ICU code in doing much of the
design on this system. What my email didn't end up saying was, basically,
that the "char" functions end up decoding a scalar value internally in a
32-bit integer value.

The question, I guess, boils down to: put it in the interface, or hide it
in the internals. ICU exposes it. My spec, up to this point, hides it,
because I think that programmers will be working with strings more often
than with individual characters and that perhaps this will seem more
"natural".

Addison

===
Addison P. PhillipsPrincipal Consultant
Inter-Locale LLChttp://www.inter-locale.com
Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
===
Globalization Engineering & Consulting Services

On Thu, 16 Nov 2000, Mark Davis wrote:

> We have found that it works pretty well to have a uchar32 datatype, with
> uchar16 storage in strings. In ICU (C version) we use macros for efficient
> access; in ICU (C++) version we use method calls, and for ICU (Java version)
> we have a set of utility static methods (since we can't add to the Java
> String API).
> 
> With these functions, the number of changes that you have to make to
> existing code is fairly small, and you don't have to change the way that
> loops are set up, for example.
> 
> Mark
> 
> - Original Message -
> From: <[EMAIL PROTECTED]>
> To: "Unicode List" <[EMAIL PROTECTED]>
> Sent: Thursday, November 16, 2000 13:24
> Subject: string vs. char [was Re: Java and Unicode]
> 
> 
> > Normally this thread would be of only academic interest to me...
> >
> > ...but this week I'm writing a spec for adding Unicode support to an
> > embedded operating system written in C. Due to Mssrs. O'Conner and
> > Scherer's presentations at the most recent IUC, I was aware of the clash
> > between internal string representations and the Unicode Scalar Value
> > necessary for efficient lookup.
> >
> > Now I'm getting alarmed about the solution I've selected.
> >
> > The OS I'm working on is written in C. I considered, therefore, using
> > UTF-8 as the internal Unicode representation (because I don't have the
> > option of #defining Unicode and using wchar), but the storage expansion
> > and the fact that several existing modules grok UTF-16 (well, UCS-2), led
> > me to go in the direction of UTF-16.
> >
> > I also considered supporting only UCS-2. It's a bad bad bad idea, but it
> > gets me out of the following:
> >
> > I ended up deciding that the Unicode API for this OS will only work in
> > strings. CTYPE replacement functions (such as isalpha) and character based
> > replacement functions (such as strchr) will take and return strings for
> > all of their arguments.
> >
> > Internally, my functions are converting the pointed character to its
> > scalar value (to look it up in the database most efficiently).
> >
> > This isn't very satisfying. It goes somewhat against the grain of 'C'
> > programming. But it's equally unsatisfying to use a 32-bit representation
> > for a character and a 16-bit representation for a string, because in 'C',
> > a string *is* an array of characters. Which is more
> > natural? Which is more common? Iterating across an array of 16-bit values
> > or
> >
> > ===
> > Addison P. PhillipsPrincipal Consultant
> > Inter-Locale LLChttp://www.inter-locale.com
> > Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]
> >
> > +1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
> > ===
> > Globalization Engineering & Consulting Services
> >
> >
> >
> 
> 




RE: string vs. char [was Re: Java and Unicode]

2000-11-17 Thread Marco Cimarosti

Ooops!

In my previous message, I wrote:

> wchar_t * _wcschr_32(const wint_t * s, wchar_t c);
> wchar_t * _wcsrchr_32(const wint_t * s, wchar_t c);

What I actually wanted to write is:

wchar_t * _wcschr_32(const wchar_t * s, wint_t c);
wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c);

Sorry if this puzzled you.

_ Marco

__
La mia e-mail è ora: My e-mail is now:
>>>   marco.cimarostiªeurope.com   <<<
(Cambiare "ª" in "@")  (Change "ª" to "@")
 

__
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup



RE: string vs. char [was Re: Java and Unicode]

2000-11-17 Thread Marco Cimarosti

Addison P. Phillips wrote:
> I ended up deciding that the Unicode API for this OS will only work in
> strings. CTYPE replacement functions (such as isalpha) and
> character based
> replacement functions (such as strchr) will take and return
> strings for
> all of their arguments.
>
> Internally, my functions are converting the pointed character to its
> scalar value (to look it up in the database most efficiently).
>
> This isn't very satisfying. It goes somewhat against the grain of 'C'
> programming. But it's equally unsatisfying to use a 32-bit
> representation
> for a character and a 16-bit representation for a string,
> because in 'C',
> a string *is* an array of characters. Which is more
> natural? Which is more common? Iterating across an array of
> 16-bit values
> or

Actually, C does have different types for characters within strings and for
characters in isolation.

The type of a string literal (e.g. "Hello world!\n") is "array of char",
while the type of a character literal (e.g. 'H') is "int".

This distinction is generally reflected also in the C library, so that you
don't get compiler warnings when passing character constants to functions.

E.g., compare the following functions from :

int fputs(const char * s, FILE * stream);
int fputc(int c, FILE * stream);

The same convention is generally used through the C library, not only in the
I/O functions. E.g.:

int isalpha(int c);
int tolower(int c);

This distinction has been retained also in the newer "wide character
library": "wchar_t" is the wide equivalent of "char", while "wint_t" is the
wide equivalent of "int".

The wide version of the examples above is:

int fputws(const wchar_t * c, FILE * stream);
wint_t fputwc(wint_t c, FILE * stream);

int iswalpha(wint_t c);
wint_t towlower(wint_t c);

In an Unicode implementation of the "wide character library" (wchar.h and
wctype.h), this difference may be exploited to use different UTF's for
strings and characters:

typedef unsigned short wchar_t;
/* UTF-16 character, used within string. */

typedef unsigned long  wint_t;
/* UTF-32 character, used for handling isolated characters. */

But, unluckily, there is a "but". Type "wchar_t" is *also* used for isolated
character in a couple of stupid APIs:

wchar_t * wcschr(const wchar_t * s, wchar_t c);
wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);

BTW, the blunder in wcschr() and wcsrchr() is inherited from their "narrow"
ancestors: strchr() and strrchr().

But I think that changing those "wchar_t c" to "wint_t c" is a smaller
"violence" to the standards than changing them to "const wchar_t * c".
And you can also implement it in an elegant, quasi-standard way:

wchar_t * _wcschr_32(const wint_t * s, wchar_t c);
wchar_t * _wcsrchr_32(const wint_t * s, wchar_t c);
size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);

#ifdef PEDANTIC_STANDARD
wchar_t * wcschr(const wchar_t * s, wchar_t c);
wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
#else
#define wcschr  _wcschr_32
#define wcsrchr _wcsrchr_32
#define wcrtomb _wcrtomb_32
#endif

I would like to see the opinion of C standardization experts (e.g. A. Leca)
about this forcing of the C standard.

_ Marco.

__
La mia e-mail è ora: My e-mail is now:
>>>   marco.cimarostiªeurope.com   <<<
(Cambiare "ª" in "@")  (Change "ª" to "@")
 

__
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup



Re: string vs. char [was Re: Java and Unicode]

2000-11-16 Thread Mark Davis

We have found that it works pretty well to have a uchar32 datatype, with
uchar16 storage in strings. In ICU (C version) we use macros for efficient
access; in ICU (C++) version we use method calls, and for ICU (Java version)
we have a set of utility static methods (since we can't add to the Java
String API).

With these functions, the number of changes that you have to make to
existing code is fairly small, and you don't have to change the way that
loops are set up, for example.

Mark

- Original Message -
From: <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, November 16, 2000 13:24
Subject: string vs. char [was Re: Java and Unicode]


> Normally this thread would be of only academic interest to me...
>
> ...but this week I'm writing a spec for adding Unicode support to an
> embedded operating system written in C. Due to Mssrs. O'Conner and
> Scherer's presentations at the most recent IUC, I was aware of the clash
> between internal string representations and the Unicode Scalar Value
> necessary for efficient lookup.
>
> Now I'm getting alarmed about the solution I've selected.
>
> The OS I'm working on is written in C. I considered, therefore, using
> UTF-8 as the internal Unicode representation (because I don't have the
> option of #defining Unicode and using wchar), but the storage expansion
> and the fact that several existing modules grok UTF-16 (well, UCS-2), led
> me to go in the direction of UTF-16.
>
> I also considered supporting only UCS-2. It's a bad bad bad idea, but it
> gets me out of the following:
>
> I ended up deciding that the Unicode API for this OS will only work in
> strings. CTYPE replacement functions (such as isalpha) and character based
> replacement functions (such as strchr) will take and return strings for
> all of their arguments.
>
> Internally, my functions are converting the pointed character to its
> scalar value (to look it up in the database most efficiently).
>
> This isn't very satisfying. It goes somewhat against the grain of 'C'
> programming. But it's equally unsatisfying to use a 32-bit representation
> for a character and a 16-bit representation for a string, because in 'C',
> a string *is* an array of characters. Which is more
> natural? Which is more common? Iterating across an array of 16-bit values
> or
>
> ===
> Addison P. PhillipsPrincipal Consultant
> Inter-Locale LLChttp://www.inter-locale.com
> Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]
>
> +1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
> ===
> Globalization Engineering & Consulting Services
>
>
>




string vs. char [was Re: Java and Unicode]

2000-11-16 Thread addison

Normally this thread would be of only academic interest to me...

...but this week I'm writing a spec for adding Unicode support to an
embedded operating system written in C. Due to Mssrs. O'Conner and
Scherer's presentations at the most recent IUC, I was aware of the clash
between internal string representations and the Unicode Scalar Value
necessary for efficient lookup.

Now I'm getting alarmed about the solution I've selected.

The OS I'm working on is written in C. I considered, therefore, using
UTF-8 as the internal Unicode representation (because I don't have the
option of #defining Unicode and using wchar), but the storage expansion
and the fact that several existing modules grok UTF-16 (well, UCS-2), led
me to go in the direction of UTF-16.

I also considered supporting only UCS-2. It's a bad bad bad idea, but it
gets me out of the following:

I ended up deciding that the Unicode API for this OS will only work in
strings. CTYPE replacement functions (such as isalpha) and character based
replacement functions (such as strchr) will take and return strings for
all of their arguments.

Internally, my functions are converting the pointed character to its
scalar value (to look it up in the database most efficiently).

This isn't very satisfying. It goes somewhat against the grain of 'C'
programming. But it's equally unsatisfying to use a 32-bit representation
for a character and a 16-bit representation for a string, because in 'C',
a string *is* an array of characters. Which is more
natural? Which is more common? Iterating across an array of 16-bit values
or 

===
Addison P. PhillipsPrincipal Consultant
Inter-Locale LLChttp://www.inter-locale.com
Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
===
Globalization Engineering & Consulting Services