Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread Antoine Leca

Marco Cimarosti wrote:
 
 Actually, C does have different types for characters within strings and for
 characters in isolation.

That is not my point of view.
There is a special case for 'H', that holds int type rather than char, for
backward compatibility reasons (such as because the first versions of C were
not able to deal correctly with to-be-promoted arguments). Similarly, a
number of (old) functions use int for the character arguments.
Then, there is the point of view that int represents _either_ a valid
character, _or_ an error indication (EOF). This is the reason that makes
int used for the return type of fgetc.

Outside this, a string is clearly an array of characters, and characters are
stored using the type char (or one of the sign alternatives). As a result,
you can write 'H' either as such, or as "Hello, world!\n"[0].


 The type of a string literal (e.g. "Hello world!\n") is "array of char",
 while the type of a character literal (e.g. 'H') is "int".

 This distinction is generally reflected also in the C library, so that you
 don't get compiler warnings when passing character constants to functions.

You need not, since C considers character to be (small) integers, which eases
passing of arguments. This is unrelated to the issue.

 
 This distinction has been retained also in the newer "wide character
 library": "wchar_t" is the wide equivalent of "char", while "wint_t" is the
 wide equivalent of "int".

Not exactly. The wide versions has the same distinction as the narrow one for
the second case above (finding errors), but not for the first one (promoting).

 
 The wide version of the examples above is:
 
 int fputws(const wchar_t * c, FILE * stream);
 wint_t fputwc(wint_t c, FILE * stream);
^^
Instead, we have
  int fputws(const wchar_t * s, FILE * stream);
  wint_t fputwc(wchar_t c, FILE *stream);

It shows clearly that c cannot hold the WEOF value. OTOH, the returned value
_can_ be the error indication WEOF, so the type is wint_t.

Similarly, the type of L'H' is wchar_t. You gave other examples in your "But".

 
 int iswalpha(wint_t c);

Here, the iswalpha is intended to be able to test valid characters as well
as the error indication, so the type is wint_t; here WEOF is specifically
allowed.


 In an Unicode implementation of the "wide character library" (wchar.h and
 wctype.h), this difference may be exploited to use different UTF's for
 strings and characters:

Ah, now we go into the interresting field.
Please note that I left aside UTF-16, because I am not clear if 16-bit are
adequate or not to code UTF-16 in wchar_t (in other words, if wchar_t can be
a multiwide encoding).

 
 typedef unsigned short wchar_t;
 /* UTF-16 character, used within string. */
 
 typedef unsigned long  wint_t;
 /* UTF-32 character, used for handling isolated characters. */

To date, no problem.

 
 But, unluckily, there is a "but". Type "wchar_t" is *also* used for isolated
 character in a couple of stupid APIs:

See above for another example: fputwc...

 
 But I think that changing those "wchar_t c" to "wint_t c" is a smaller
 "violence" to the standards than changing them to "const wchar_t * c".

;-)

 And you can also implement it in an elegant, quasi-standard way:
corrected 
 wchar_t * _wcschr_32(const wchar_t * s, wint_t c);
 wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c);
 size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);

What is the point? You cannot pass to these anything other than values
between 0 (WCHAR_MIN) and WCHAR_MAX anyway. And there are no really
"interesting" ways to extend the meaning of these functions outside
this range.
Or do I miss something?

 
 #ifdef PEDANTIC_STANDARD
 wchar_t * wcschr(const wchar_t * s, wchar_t c);
 wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
 size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
 #else
 #define wcschr  _wcschr_32
 #define wcsrchr _wcsrchr_32
 #define wcrtomb _wcrtomb_32
 #endif



Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread Marco Cimarosti

Antoine Leca wrote:
 Marco Cimarosti wrote:
  Actually, C does have different types for characters within
 strings and for
  characters in isolation.

 That is not my point of view.
 There is a special case for 'H', that holds int type rather
 than char, for
 backward compatibility reasons (such as because the first
 versions of C were
 not able to deal correctly with to-be-promoted arguments).
 Similarly, a
 number of (old) functions use int for the character arguments.
 Then, there is the point of view that int represents _either_ a valid
 character, _or_ an error indication (EOF). This is the reason
 that makes
 int used for the return type of fgetc.

OK.

 Outside this, a string is clearly an array of characters, and
 characters are
 stored using the type char (or one of the sign alternatives).
 As a result,
 you can write 'H' either as such, or as "Hello, world!\n"[0].

OK.

  The type of a string literal (e.g. "Hello world!\n") is
 "array of char",
  while the type of a character literal (e.g. 'H') is "int".
 
  This distinction is generally reflected also in the C
 library, so that you
  don't get compiler warnings when passing character
 constants to functions.

 You need not, since C considers character to be (small)
 integers, which eases
 passing of arguments. This is unrelated to the issue.

OK. I was just describing the background.

  This distinction has been retained also in the newer "wide character
  library": "wchar_t" is the wide equivalent of "char", while
 "wint_t" is the
  wide equivalent of "int".

 Not exactly. The wide versions has the same distinction as
 the narrow one for
 the second case above (finding errors), but not for the first
 one (promoting).

OK.

  The wide version of the examples above is:
 
  int fputws(const wchar_t * c, FILE * stream);
  wint_t fputwc(wint_t c, FILE * stream);
 ^^
 Instead, we have
   int fputws(const wchar_t * s, FILE * stream);
   wint_t fputwc(wchar_t c, FILE *stream);

 It shows clearly that c cannot hold the WEOF value. OTOH, the
 returned value
 _can_ be the error indication WEOF, so the type is wint_t.

Oops! Sorry.

I had two versions of "wchar.h" at hand: one (lovingly crafted by myself)
had «wchar_t c»; the other one (shipped with Microsoft Visual C++ 6.0) had:

_CRTIMP wint_t __cdecl fputwc(wint_t, FILE *);

I did the mistake to trust the second one. :-)

 Similarly, the type of L'H' is wchar_t. You gave other
 examples in your "But".

  int iswalpha(wint_t c);

 Here, the iswalpha is intended to be able to test valid
 characters as well
 as the error indication, so the type is wint_t; here WEOF is
 specifically allowed.

OK.

  In an Unicode implementation of the "wide character
 library" (wchar.h and
  wctype.h), this difference may be exploited to use
 different UTF's for
  strings and characters:

 Ah, now we go into the interresting field.
 Please note that I left aside UTF-16, because I am not clear
 if 16-bit are
 adequate or not to code UTF-16 in wchar_t (in other words, if
 wchar_t can be
 a multiwide encoding).

  typedef unsigned short wchar_t;
  /* UTF-16 character, used within string. */
 
  typedef unsigned long  wint_t;
  /* UTF-32 character, used for handling isolated characters. */

 To date, no problem.

  But, unluckily, there is a "but". Type "wchar_t" is *also*
 used for isolated
  character in a couple of stupid APIs:

 See above for another example: fputwc...

  But I think that changing those "wchar_t c" to "wint_t c"
 is a smaller
  "violence" to the standards than changing them to "const
 wchar_t * c".

 ;-)

OK, my trick is dirty as well, just a bit easier to hide. ;-)

  And you can also implement it in an elegant, quasi-standard way:
 corrected
  wchar_t * _wcschr_32(const wchar_t * s, wint_t c);
  wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c);
  size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);

 What is the point? You cannot pass to these anything other than values
 between 0 (WCHAR_MIN) and WCHAR_MAX anyway. And there are no really
 "interesting" ways to extend the meaning of these functions outside
 this range.
 Or do I miss something?

_wcschr_32 and _wcsrchr_32 would return a pointer to the first (or last)
occurrence of the specified character in the string, just like their
standard counterparts.

But if «c = 0x1000», then the character would be represented in «s» (an
UTF-16 string) by a surrogate pair, and the function would thus return the
address of the *high surrogate*.

E.g., assuming that «s» is «{0x2190, 0xD800, 0xDC05, 0x2192, 0x}» and
«c» is 0x1005, both functions would return «s[1]»: the address of the high
surrogate 0xD800.

Similarly for _wcrtomb_32(): assuming that «s» points into an UTF-8 string,
the function would insert in «s» the 3-octets UTF-8 sequence corresponding
to «c».

  #ifdef PEDANTIC_STANDARD
  wchar_t * wcschr(const wchar_t * s, wchar_t c);
  wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
  size_t wcrtomb(char * s, wchar_t c, mbstate_t 

Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread Michael \(michka\) Kaplan

From: "Marco Cimarosti" [EMAIL PROTECTED]

 the Surrograte (aka "Astral") Planes.

I believe the UTC has deprecated the term Astral planes with extreme
prejudice. HTH!

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/





Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread David Starner

On Mon, Nov 20, 2000 at 06:54:27AM -0800, Michael (michka) Kaplan wrote:
 From: "Marco Cimarosti" [EMAIL PROTECTED]
 
  the Surrograte (aka "Astral") Planes.
 
 I believe the UTC has deprecated the term Astral planes with extreme
 prejudice. HTH!

The UTC has chosen not use the term Astral Plane. Keeping that in mind,
I can chose to use whatever terms I want, realizing of course that some
may not get my point across. The UTC chose Surrogate Planes for perceived
functionality and translatability; I chose Astral Planes for perceived grace
and beauty. 

-- 
David Starner - [EMAIL PROTECTED]
http://dvdeug.dhis.org
Looking for a Debian developer in the Stillwater, Oklahoma area 
to sign my GPG key



Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread Michael \(michka\) Kaplan

I think the issue is more one of the semantic meaning that terms like
astral, imaginary, irrational, or other such terms bring to the table?

Refusing to potentially insult the people who place importance on the
characters that will be encoded on places on than the BMP is a thing of
grace and beauty (much moreso than the insult would be!).

I think the UTC action is a responsible one.

(Just my two cents)

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: "David Starner" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Monday, November 20, 2000 7:18 AM
Subject: Re: string vs. char [was Re: Java and Unicode]


 On Mon, Nov 20, 2000 at 06:54:27AM -0800, Michael (michka) Kaplan wrote:
  From: "Marco Cimarosti" [EMAIL PROTECTED]
 
   the Surrograte (aka "Astral") Planes.
 
  I believe the UTC has deprecated the term Astral planes with extreme
  prejudice. HTH!

 The UTC has chosen not use the term Astral Plane. Keeping that in mind,
 I can chose to use whatever terms I want, realizing of course that some
 may not get my point across. The UTC chose Surrogate Planes for perceived
 functionality and translatability; I chose Astral Planes for perceived
grace
 and beauty.

 --
 David Starner - [EMAIL PROTECTED]
 http://dvdeug.dhis.org
 Looking for a Debian developer in the Stillwater, Oklahoma area
 to sign my GPG key





Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread John Cowan

David Starner wrote:
 I chose Astral Planes for perceived grace
 and beauty.

Thank you!

-- 
There is / one art   || John Cowan [EMAIL PROTECTED]
no more / no less|| http://www.reutershealth.com
to do / all things   || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein



[totally OT] Unicode terminology (was Re: string vs. char [was Re: Java and Unicode])

2000-11-20 Thread Marco Cimarosti

David Starner wrote:
 Sent: 20 Nov 2000, Mon 16.18
 To: Unicode List
 Subject: Re: string vs. char [was Re: Java and Unicode]

 On Mon, Nov 20, 2000 at 06:54:27AM -0800, Michael (michka)
 Kaplan wrote:
  From: "Marco Cimarosti" [EMAIL PROTECTED]
 
   the Surrograte (aka "Astral") Planes.
 
  I believe the UTC has deprecated the term Astral planes with extreme
  prejudice. HTH!

 The UTC has chosen not use the term Astral Plane. Keeping
 that in mind,
 I can chose to use whatever terms I want, realizing of course
 that some
 may not get my point across. The UTC chose Surrogate Planes
 for perceived
 functionality and translatability; I chose Astral Planes for
 perceived grace and beauty.

Well, I am not as angrily pro "Astral Planes" as David is, but I too find
the humorous term prettier than the official one. And I used it because I
think that a few people on this list may still find it clearer than the
official "Surrogate Planes" -- which is more serious and descriptive, but
still relatively new to many.

Moreover, although my attitude towards the UTC (the "government" of Unicode)
is much more friendly than my attitude towards real governments out there
(if people like J. Jenkinks or M. Davis were the President of the USA this
would be a much nicer world!), still I don't feel quite like obeying to any
government's orders, prohoibitions or deprecations without opposing the due
resistance.

8-) (-- smiley wearing anti-tear-gas glasses)

_ Marco

__
La mia e-mail è ora: My e-mail is now:
   marco.cimarostiªeurope.com   
(Cambiare "ª" in "@")  (Change "ª" to "@")
 

__
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup



Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread addison

Hi Jani,

I dunno. I oversimplified in that statement about exposing vs. hiding.

ICU "hides" the facts about the Unicode implementation in macros,
specifically a next and previous character macro and various other
fillips. If you look very closely at the function (method) prototypes you
can see that, in fact, a "character" is a 32-bit entity and a string is
made (conditionally) of 16-bit entities. But, as you suggest, ICU makes it
easy to work with (and is set up so that a sufficiently motivated coder
could change the internal encoding).

rant
If you ask a 100 programmers the index of the string, they'll give you the
wrong answer 99 times... because there is little or no I18n training in
the course of becoming a programmer. The members of this list are
continually ground down by the sheer inertia of ignorance (I just gave up
answering one about email... I must have written a response to that
message a bunch of times, but don't have the time or stamina this morning
to go find and rework one of them).
/rant

In any case this has been a fun and instructive interlude. As I said in
my initial email, I tend to be a CONSUMER of Unicode APIs rather than a
creator. I haven't written a Unicode support package in quite some time
(and the last one was a UTF-8 hack in C++). It's good to be familiar with
the details, but I find that, as a programmer one typically doesn't fully
comprehend the design decisions until one faces them oneself. As it is, I
ended up changing my design and sample code over the weekend to follow the
suggestions of several on this list who've Been There.

As a side note: one of the problems I faced on this project was the need
to keep the Unicode and locale libraries extremely small (this is an
embedded OS). I would happily have borrowed ICU to actually *be* the
library... but it's too large. I've had to design a tiny (and therefore
quite limited) support library. It's been an interesting experience.

Best Regards,

Addison

===
Addison P. PhillipsPrincipal Consultant
Inter-Locale LLChttp://www.inter-locale.com
Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
===
Globalization Engineering  Consulting Services

On Mon, 20 Nov 2000, Jani Kajala wrote:

 
  The question, I guess, boils down to: put it in the interface, or hide it
  in the internals. ICU exposes it. My spec, up to this point, hides it,
 
 (I'm aware that the original question was about C interfaces so you might consider 
this a bit out of topic but I just wanted to comment about the exposed encoding)
 
 I think that exposing encoding in interfaces doesn't do any good. It violates 
oriented design principles and it is not even intuitive.
 
 I'd bet that if we take 100 programmers and ask them 'What is this index in context 
of this string?' in every case we'll get an answer that its of course the nth 
character position. Nobody who isn't well aware of character encoding will ever think 
of code units. Thus, it is not intuitive to use indices to point at code units. 
Especially as Unicode has been so well-marketed as '16-bit character set'.
 
 Besides, you can always use (C++ style) iterators instead of plain indices without 
any loss in performance or in syntactic convenience. With an 'iterator' in this I 
refer to simple encapsulated pointer which behaves just as any C++ Standard Template 
Library random access iterator but takes encoding into account. Example:
 
 for ( String::Iterator i = s.begin() ; i != s.end() ; ++i )
 // ith character in s = *i
 // i+nth character in s = i[n]
 
 The solution works with any encoding as long as string::iterator is defined properly.
 
 The conclusion that using indices won't make a difference in performance also makes 
sense if you consider the basic underlying task: If you need random access to a 
string you need to check for characters spanning over multiple code units. So the 
task is the same O(n) complexity, using indices won't help a bit. If the user needs 
the access to arbitrary character he needs to iterate anyway. It is just matter how 
you want to encapsulate the task.
 
 
 Regards,
 Jani Kajala
 http://www.helsinki.fi/~kajala/
 
 




Re: string vs. char [was Re: Java and Unicode]

2000-11-20 Thread Mark Davis

The UTC will be using the terms "supplementary code points", "supplementary
characters" and "supplementary planes". The term it is "deprecating with
extreme prejudice" is "surrogate characters".

See http://www.unicode.org/glossary/ for more information.

Mark

- Original Message -
From: "Michael (michka) Kaplan" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Monday, November 20, 2000 06:54
Subject: Re: string vs. char [was Re: Java and Unicode]


 From: "Marco Cimarosti" [EMAIL PROTECTED]

  the Surrograte (aka "Astral") Planes.

 I believe the UTC has deprecated the term Astral planes with extreme
 prejudice. HTH!

 michka

 a new book on internationalization in VB at
 http://www.i18nWithVB.com/







RE: string vs. char [was Re: Java and Unicode]

2000-11-17 Thread Marco Cimarosti

Addison P. Phillips wrote:
 I ended up deciding that the Unicode API for this OS will only work in
 strings. CTYPE replacement functions (such as isalpha) and
 character based
 replacement functions (such as strchr) will take and return
 strings for
 all of their arguments.

 Internally, my functions are converting the pointed character to its
 scalar value (to look it up in the database most efficiently).

 This isn't very satisfying. It goes somewhat against the grain of 'C'
 programming. But it's equally unsatisfying to use a 32-bit
 representation
 for a character and a 16-bit representation for a string,
 because in 'C',
 a string *is* an array of characters. Which is more
 natural? Which is more common? Iterating across an array of
 16-bit values
 or

Actually, C does have different types for characters within strings and for
characters in isolation.

The type of a string literal (e.g. "Hello world!\n") is "array of char",
while the type of a character literal (e.g. 'H') is "int".

This distinction is generally reflected also in the C library, so that you
don't get compiler warnings when passing character constants to functions.

E.g., compare the following functions from stdio.h:

int fputs(const char * s, FILE * stream);
int fputc(int c, FILE * stream);

The same convention is generally used through the C library, not only in the
I/O functions. E.g.:

int isalpha(int c);
int tolower(int c);

This distinction has been retained also in the newer "wide character
library": "wchar_t" is the wide equivalent of "char", while "wint_t" is the
wide equivalent of "int".

The wide version of the examples above is:

int fputws(const wchar_t * c, FILE * stream);
wint_t fputwc(wint_t c, FILE * stream);

int iswalpha(wint_t c);
wint_t towlower(wint_t c);

In an Unicode implementation of the "wide character library" (wchar.h and
wctype.h), this difference may be exploited to use different UTF's for
strings and characters:

typedef unsigned short wchar_t;
/* UTF-16 character, used within string. */

typedef unsigned long  wint_t;
/* UTF-32 character, used for handling isolated characters. */

But, unluckily, there is a "but". Type "wchar_t" is *also* used for isolated
character in a couple of stupid APIs:

wchar_t * wcschr(const wchar_t * s, wchar_t c);
wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);

BTW, the blunder in wcschr() and wcsrchr() is inherited from their "narrow"
ancestors: strchr() and strrchr().

But I think that changing those "wchar_t c" to "wint_t c" is a smaller
"violence" to the standards than changing them to "const wchar_t * c".
And you can also implement it in an elegant, quasi-standard way:

wchar_t * _wcschr_32(const wint_t * s, wchar_t c);
wchar_t * _wcsrchr_32(const wint_t * s, wchar_t c);
size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);

#ifdef PEDANTIC_STANDARD
wchar_t * wcschr(const wchar_t * s, wchar_t c);
wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
#else
#define wcschr  _wcschr_32
#define wcsrchr _wcsrchr_32
#define wcrtomb _wcrtomb_32
#endif

I would like to see the opinion of C standardization experts (e.g. A. Leca)
about this forcing of the C standard.

_ Marco.

__
La mia e-mail è ora: My e-mail is now:
   marco.cimarostiªeurope.com   
(Cambiare "ª" in "@")  (Change "ª" to "@")
 

__
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup



RE: string vs. char [was Re: Java and Unicode]

2000-11-17 Thread Marco Cimarosti

Ooops!

In my previous message, I wrote:

 wchar_t * _wcschr_32(const wint_t * s, wchar_t c);
 wchar_t * _wcsrchr_32(const wint_t * s, wchar_t c);

What I actually wanted to write is:

wchar_t * _wcschr_32(const wchar_t * s, wint_t c);
wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c);

Sorry if this puzzled you.

_ Marco

__
La mia e-mail è ora: My e-mail is now:
   marco.cimarostiªeurope.com   
(Cambiare "ª" in "@")  (Change "ª" to "@")
 

__
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup



Re: string vs. char [was Re: Java and Unicode]

2000-11-17 Thread addison

Thanks Mark. I've looked extensively at the ICU code in doing much of the
design on this system. What my email didn't end up saying was, basically,
that the "char" functions end up decoding a scalar value internally in a
32-bit integer value.

The question, I guess, boils down to: put it in the interface, or hide it
in the internals. ICU exposes it. My spec, up to this point, hides it,
because I think that programmers will be working with strings more often
than with individual characters and that perhaps this will seem more
"natural".

Addison

===
Addison P. PhillipsPrincipal Consultant
Inter-Locale LLChttp://www.inter-locale.com
Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
===
Globalization Engineering  Consulting Services

On Thu, 16 Nov 2000, Mark Davis wrote:

 We have found that it works pretty well to have a uchar32 datatype, with
 uchar16 storage in strings. In ICU (C version) we use macros for efficient
 access; in ICU (C++) version we use method calls, and for ICU (Java version)
 we have a set of utility static methods (since we can't add to the Java
 String API).
 
 With these functions, the number of changes that you have to make to
 existing code is fairly small, and you don't have to change the way that
 loops are set up, for example.
 
 Mark
 
 - Original Message -
 From: [EMAIL PROTECTED]
 To: "Unicode List" [EMAIL PROTECTED]
 Sent: Thursday, November 16, 2000 13:24
 Subject: string vs. char [was Re: Java and Unicode]
 
 
  Normally this thread would be of only academic interest to me...
 
  ...but this week I'm writing a spec for adding Unicode support to an
  embedded operating system written in C. Due to Mssrs. O'Conner and
  Scherer's presentations at the most recent IUC, I was aware of the clash
  between internal string representations and the Unicode Scalar Value
  necessary for efficient lookup.
 
  Now I'm getting alarmed about the solution I've selected.
 
  The OS I'm working on is written in C. I considered, therefore, using
  UTF-8 as the internal Unicode representation (because I don't have the
  option of #defining Unicode and using wchar), but the storage expansion
  and the fact that several existing modules grok UTF-16 (well, UCS-2), led
  me to go in the direction of UTF-16.
 
  I also considered supporting only UCS-2. It's a bad bad bad idea, but it
  gets me out of the following:
 
  I ended up deciding that the Unicode API for this OS will only work in
  strings. CTYPE replacement functions (such as isalpha) and character based
  replacement functions (such as strchr) will take and return strings for
  all of their arguments.
 
  Internally, my functions are converting the pointed character to its
  scalar value (to look it up in the database most efficiently).
 
  This isn't very satisfying. It goes somewhat against the grain of 'C'
  programming. But it's equally unsatisfying to use a 32-bit representation
  for a character and a 16-bit representation for a string, because in 'C',
  a string *is* an array of characters. Which is more
  natural? Which is more common? Iterating across an array of 16-bit values
  or
 
  ===
  Addison P. PhillipsPrincipal Consultant
  Inter-Locale LLChttp://www.inter-locale.com
  Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]
 
  +1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
  ===
  Globalization Engineering  Consulting Services
 
 
 
 
 




RE: string vs. char [was Re: Java and Unicode]

2000-11-17 Thread addison

Well... I think you're right. I knew that char and string units weren't
really the same thing. My concern was how to make it easy on developers to
use the Unicode API using their "native intelligence".

More thought makes me less certain of my approach. Specifically, as Mark
points out, looping structures are much more ugly when using
pointers. And I still have to have all of the code for the scalar value
conversion. It's better to force a casting macro on the developers than
trying to do their dirty work for them. Fooling developers who use your
API is a Bad Idea, usually.

Thanks again for the feedback.

Addison

===
Addison P. PhillipsPrincipal Consultant
Inter-Locale LLChttp://www.inter-locale.com
Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
===
Globalization Engineering  Consulting Services

On Fri, 17 Nov 2000, Marco Cimarosti wrote:

 Addison P. Phillips wrote:
  I ended up deciding that the Unicode API for this OS will only work in
  strings. CTYPE replacement functions (such as isalpha) and
  character based
  replacement functions (such as strchr) will take and return
  strings for
  all of their arguments.
 
  Internally, my functions are converting the pointed character to its
  scalar value (to look it up in the database most efficiently).
 
  This isn't very satisfying. It goes somewhat against the grain of 'C'
  programming. But it's equally unsatisfying to use a 32-bit
  representation
  for a character and a 16-bit representation for a string,
  because in 'C',
  a string *is* an array of characters. Which is more
  natural? Which is more common? Iterating across an array of
  16-bit values
  or
 
 Actually, C does have different types for characters within strings and for
 characters in isolation.
 
 The type of a string literal (e.g. "Hello world!\n") is "array of char",
 while the type of a character literal (e.g. 'H') is "int".
 
 This distinction is generally reflected also in the C library, so that you
 don't get compiler warnings when passing character constants to functions.
 
 E.g., compare the following functions from stdio.h:
 
 int fputs(const char * s, FILE * stream);
 int fputc(int c, FILE * stream);
 
 The same convention is generally used through the C library, not only in the
 I/O functions. E.g.:
 
 int isalpha(int c);
 int tolower(int c);
 
 This distinction has been retained also in the newer "wide character
 library": "wchar_t" is the wide equivalent of "char", while "wint_t" is the
 wide equivalent of "int".
 
 The wide version of the examples above is:
 
 int fputws(const wchar_t * c, FILE * stream);
 wint_t fputwc(wint_t c, FILE * stream);
 
 int iswalpha(wint_t c);
 wint_t towlower(wint_t c);
 
 In an Unicode implementation of the "wide character library" (wchar.h and
 wctype.h), this difference may be exploited to use different UTF's for
 strings and characters:
 
 typedef unsigned short wchar_t;
 /* UTF-16 character, used within string. */
 
 typedef unsigned long  wint_t;
 /* UTF-32 character, used for handling isolated characters. */
 
 But, unluckily, there is a "but". Type "wchar_t" is *also* used for isolated
 character in a couple of stupid APIs:
 
 wchar_t * wcschr(const wchar_t * s, wchar_t c);
 wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
 size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
 
 BTW, the blunder in wcschr() and wcsrchr() is inherited from their "narrow"
 ancestors: strchr() and strrchr().
 
 But I think that changing those "wchar_t c" to "wint_t c" is a smaller
 "violence" to the standards than changing them to "const wchar_t * c".
 And you can also implement it in an elegant, quasi-standard way:
 
 wchar_t * _wcschr_32(const wint_t * s, wchar_t c);
 wchar_t * _wcsrchr_32(const wint_t * s, wchar_t c);
 size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);
 
 #ifdef PEDANTIC_STANDARD
 wchar_t * wcschr(const wchar_t * s, wchar_t c);
 wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
 size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
 #else
 #define wcschr  _wcschr_32
 #define wcsrchr _wcsrchr_32
 #define wcrtomb _wcrtomb_32
 #endif
 
 I would like to see the opinion of C standardization experts (e.g. A. Leca)
 about this forcing of the C standard.
 
 _ Marco.
 
 __
 La mia e-mail è ora: My e-mail is now:
marco.cimarostiªeurope.com   
 (Cambiare "ª" in "@")  (Change "ª" to "@")
  
 
 __
 FREE Personalized Email at Mail.com
 Sign up at http://www.mail.com/?sr=signup
 




Re: Java and Unicode

2000-11-16 Thread Elliotte Rusty Harold

At 4:44 PM -0800 11/15/00, Markus Scherer wrote:

In the case of Java, the equivalent course of action would be to 
stick with a 16-bit char as the base type for strings. The int type 
could be used in _additional_ APIs for single Unicode code points, 
deprecating the old APIs with char.


It's not quite that simple. Many of the key APIs in Java already use 
ints instead of chars where chars are expected. In particular, the 
Reader and Writer classes in java.io do this.

I do agree that it makes sense to use strings rather than characters. 
I'm just wondering how bad the transition is going to be. Could we 
get away with eliminating (or at least deprecating) the char data 
type completely and all methods that use it? And can we do that 
without breaking all existing code and redesigning the language?

For example, consider the charAt() method in java.lang.String:

public char charAt(int index)

This method is used to walk strings, looking at each character in 
turn, a useful thing to do. Clearly it would be possible to replace 
it with a method with a String return type like this:

public String charAt(int index)

The returned string would contain a single character (which might be 
composed of two surrogate chars). However, we can't simply add that 
method because Java can't overload on return type. So we have to give 
that method a new name like:

public String characterAt(int index)

OK. That one's not too bad, maybe even more intelligible than what 
we're replacing. But we have to do this in hundreds of places in the 
API!  Some will be much worse than this.  Is it really going to be 
possible to make this sort of change everywhere? Or is it time to 
bite the bullet and break backwards compatibility? Or should we 
simply admit that non-BMP characters aren't that important and stick 
with the current API?  Or perhaps provide special classes that handle 
non-BMP characters as an ugly-bolt-on to the language that will be 
used by a few Unicode afficionados but ignored by most programmers, 
just like wchar is ignored in C to this day?

None of these solutions are attractive. It may take the next 
post-Java language to really solve them.
-- 

+---++---+
| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
+---++---+
|  The XML Bible (IDG Books, 1999)   |
|  http://metalab.unc.edu/xml/books/bible/   |
|   http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/   |
+--+-+
|  Read Cafe au Lait for Java News:  http://metalab.unc.edu/javafaq/ |
|  Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ |
+--+-+



Re: Java and Unicode

2000-11-16 Thread Valeriy E. Ushakov

On Thu, Nov 16, 2000 at 05:58:27 -0800, Elliotte Rusty Harold wrote:

 public char charAt(int index)
 
 This method is used to walk strings, looking at each character in 
 turn, a useful thing to do. Clearly it would be possible to replace 
 it with a method with a String return type like this:
 
 public String characterAt(int index)

And what method you will use to obtain the (single) character in the
returned string?  :-)

SY, Uwe
-- 
[EMAIL PROTECTED] |   Zu Grunde kommen
http://www.ptc.spbu.ru/~uwe/|   Ist zu Grunde gehen



Re: Java and Unicode

2000-11-16 Thread Elliotte Rusty Harold

At 7:26 AM -0800 11/16/00, Valeriy E. Ushakov wrote:
On Thu, Nov 16, 2000 at 05:58:27 -0800, Elliotte Rusty Harold wrote:

  public char charAt(int index)

  This method is used to walk strings, looking at each character in
  turn, a useful thing to do. Clearly it would be possible to replace
  it with a method with a String return type like this:

  public String characterAt(int index)

And what method you will use to obtain the (single) character in the
returned string?  :-)

The point is you don't need this. You would always work with strings 
and never with chars. The API would be changed so that the char data 
type could be fully deprecated.

Alternate approach: define a new Char class that could be used in 
place of char everywhere a character type was needed.
-- 

+---++---+
| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
+---++---+
|  The XML Bible (IDG Books, 1999)   |
|  http://metalab.unc.edu/xml/books/bible/   |
|   http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/   |
+--+-+
|  Read Cafe au Lait for Java News:  http://metalab.unc.edu/javafaq/ |
|  Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ |
+--+-+



Re: Java and Unicode

2000-11-16 Thread Thomas Chan

On Thu, 16 Nov 2000, Markus Scherer wrote:

 The ICU API was changed this way within a few months this year. Some of the 
higher-level implementations are still to follow until next summer, when there will 
be some 45000 CJK characters that will be infrequent but hard to ignore - the Chinese 
and Japanese governments will insist on their support.

I hope support comes soon, too, as many CJK characters commonly used in
writing the Yue (Cantonese) language wound up in CJK Extension B, such as
U+28319 lip1 'elevator/lift'.  Until then, one'll have to use a legacy
Big5-HKSCS.


Thomas Chan
[EMAIL PROTECTED]





Re: Java and Unicode

2000-11-16 Thread Markus Scherer

Juliusz Chroboczek wrote:
 I believe that Java strings use UTF-8 internally.

.class files use a _modified_ utf-8. at runtime, strings are always in 16-bit unicode.

  At any rate the
 internal implementation is not exposed to applications -- note that
 `length' is a method in class String (while it is a field in vector
 classes).

but length() and charAt() are some of the apis that expose that the internal 
representation is in 16-bit unicode, at least semantically. length() counts 16-bit 
units from ucs-2/utf-16, not bytes from utf-8 or code points from utf-32. all charAt() 
and substring() etc. behave like that.

markus



string vs. char [was Re: Java and Unicode]

2000-11-16 Thread addison

Normally this thread would be of only academic interest to me...

...but this week I'm writing a spec for adding Unicode support to an
embedded operating system written in C. Due to Mssrs. O'Conner and
Scherer's presentations at the most recent IUC, I was aware of the clash
between internal string representations and the Unicode Scalar Value
necessary for efficient lookup.

Now I'm getting alarmed about the solution I've selected.

The OS I'm working on is written in C. I considered, therefore, using
UTF-8 as the internal Unicode representation (because I don't have the
option of #defining Unicode and using wchar), but the storage expansion
and the fact that several existing modules grok UTF-16 (well, UCS-2), led
me to go in the direction of UTF-16.

I also considered supporting only UCS-2. It's a bad bad bad idea, but it
gets me out of the following:

I ended up deciding that the Unicode API for this OS will only work in
strings. CTYPE replacement functions (such as isalpha) and character based
replacement functions (such as strchr) will take and return strings for
all of their arguments.

Internally, my functions are converting the pointed character to its
scalar value (to look it up in the database most efficiently).

This isn't very satisfying. It goes somewhat against the grain of 'C'
programming. But it's equally unsatisfying to use a 32-bit representation
for a character and a 16-bit representation for a string, because in 'C',
a string *is* an array of characters. Which is more
natural? Which is more common? Iterating across an array of 16-bit values
or 

===
Addison P. PhillipsPrincipal Consultant
Inter-Locale LLChttp://www.inter-locale.com
Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
===
Globalization Engineering  Consulting Services





Re: string vs. char [was Re: Java and Unicode]

2000-11-16 Thread Mark Davis

We have found that it works pretty well to have a uchar32 datatype, with
uchar16 storage in strings. In ICU (C version) we use macros for efficient
access; in ICU (C++) version we use method calls, and for ICU (Java version)
we have a set of utility static methods (since we can't add to the Java
String API).

With these functions, the number of changes that you have to make to
existing code is fairly small, and you don't have to change the way that
loops are set up, for example.

Mark

- Original Message -
From: [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Thursday, November 16, 2000 13:24
Subject: string vs. char [was Re: Java and Unicode]


 Normally this thread would be of only academic interest to me...

 ...but this week I'm writing a spec for adding Unicode support to an
 embedded operating system written in C. Due to Mssrs. O'Conner and
 Scherer's presentations at the most recent IUC, I was aware of the clash
 between internal string representations and the Unicode Scalar Value
 necessary for efficient lookup.

 Now I'm getting alarmed about the solution I've selected.

 The OS I'm working on is written in C. I considered, therefore, using
 UTF-8 as the internal Unicode representation (because I don't have the
 option of #defining Unicode and using wchar), but the storage expansion
 and the fact that several existing modules grok UTF-16 (well, UCS-2), led
 me to go in the direction of UTF-16.

 I also considered supporting only UCS-2. It's a bad bad bad idea, but it
 gets me out of the following:

 I ended up deciding that the Unicode API for this OS will only work in
 strings. CTYPE replacement functions (such as isalpha) and character based
 replacement functions (such as strchr) will take and return strings for
 all of their arguments.

 Internally, my functions are converting the pointed character to its
 scalar value (to look it up in the database most efficiently).

 This isn't very satisfying. It goes somewhat against the grain of 'C'
 programming. But it's equally unsatisfying to use a 32-bit representation
 for a character and a 16-bit representation for a string, because in 'C',
 a string *is* an array of characters. Which is more
 natural? Which is more common? Iterating across an array of 16-bit values
 or

 ===
 Addison P. PhillipsPrincipal Consultant
 Inter-Locale LLChttp://www.inter-locale.com
 Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]

 +1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
 ===
 Globalization Engineering  Consulting Services







Re: Java and Unicode

2000-11-15 Thread Elliotte Rusty Harold

One thing I'm very curious about going forward: Right now character 
values greater than 65535 are purely theoretical. However this will 
change. It seems to me that handling these characters properly is 
going to require redefining the char data type from two bytes to 
four. This is a major incompatible change with existing Java.

There are a number of possibilities that don't break backwards 
compatibility (making trans-BMP characters require two chars rather 
than one, defining a new wchar primitive data type that is 4-bytes 
long as well as the old 2-byte char type, etc.) but they all make the 
language a lot less clean and obvious. In fact, they all more or less 
make Java feel like C and C++ feel when working with Unicode: like 
something new has been bolted on after the fact, and it doesn't 
really fit the old design.

Are there any plans for handling this?
-- 

+---++---+
| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
+---++---+
|  The XML Bible (IDG Books, 1999)   |
|  http://metalab.unc.edu/xml/books/bible/   |
|   http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/   |
+--+-+
|  Read Cafe au Lait for Java News:  http://metalab.unc.edu/javafaq/ |
|  Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ |
+--+-+



Re: Java and Unicode

2000-11-15 Thread Michael \(michka\) Kaplan

I do not think they are so theoretical, with both 10646 and Unicode
including them in the very new future (unless you count it as theoretical
when you drop an egg but it has not yet hit the ground!).

In any case, I think that UTF-16 is the answer here.

Many people try to compare this to DBCS, but it really is not the same
thing understanding lead bytes and trail bytes in DBCS is *astoundingly*
more complicated than handling surrogate pairs.

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: "Elliotte Rusty Harold" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Wednesday, November 15, 2000 6:15 AM
Subject: Re: Java and Unicode


 One thing I'm very curious about going forward: Right now character
 values greater than 65535 are purely theoretical. However this will
 change. It seems to me that handling these characters properly is
 going to require redefining the char data type from two bytes to
 four. This is a major incompatible change with existing Java.

 There are a number of possibilities that don't break backwards
 compatibility (making trans-BMP characters require two chars rather
 than one, defining a new wchar primitive data type that is 4-bytes
 long as well as the old 2-byte char type, etc.) but they all make the
 language a lot less clean and obvious. In fact, they all more or less
 make Java feel like C and C++ feel when working with Unicode: like
 something new has been bolted on after the fact, and it doesn't
 really fit the old design.

 Are there any plans for handling this?
 --

 +---++---+
 | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
 +---++---+
 |  The XML Bible (IDG Books, 1999)   |
 |  http://metalab.unc.edu/xml/books/bible/   |
 |   http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/   |
 +--+-+
 |  Read Cafe au Lait for Java News:  http://metalab.unc.edu/javafaq/ |
 |  Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ |
 +--+-+





RE: Java and Unicode

2000-11-15 Thread Marco . Cimarosti

Eliotte Rusty Harold wrote:

 One thing I'm very curious about going forward: Right now character 
 values greater than 65535 are purely theoretical. However this will 
 change. It seems to me that handling these characters properly is 
 going to require redefining the char data type from two bytes to 
 four. This is a major incompatible change with existing Java.
 (...)

John O'Conner just wrote something about surrogates
(http://www.unicode.org/unicode/faq/utf_bom.html#16) and UTF-16
(http://www.unicode.org/unicode/faq/utf_bom.html#5) in Java, but your
message was probably already on its way:

 You can currently store UTF-16 in the String and StringBuffer 
 classes. However,
 all operations are on char values or 16-bit code units. The 
 upcoming release of
 the J2SE platform will include support for Unicode 3.0 (maybe 3.0.1)
 properties, case mapping, collation, and character break 
 iteration. There is no
 explicit support for surrogate pairs in Unicode at this time, 
 although you can
 certainly find out if a code unit is a surrogate unit.
 
 In the future, as characters beyond 0x become more 
 important, you can
 expect that more robust, official support will ollow.
 
 -- John O'Conner

_ Marco



Re: Java and Unicode

2000-11-15 Thread Jungshik Shin

On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote:

 In any case, I think that UTF-16 is the answer here.
 
 Many people try to compare this to DBCS, but it really is not the same
 thing understanding lead bytes and trail bytes in DBCS is *astoundingly*
 more complicated than handling surrogate pairs.

Well, it depends on what multibyte encoding you're talking about. In case
of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to
SJIS(Windows94?), Windows-949(UHC), Windows-950,  WIndows-125x(JOHAB),
ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about
the same as UTF-16, I believe, especially in case of   EUC-CN and EUC-KR)

Jungshik Shin




Re: Java and Unicode

2000-11-15 Thread Thomas Chan

On Wed, 15 Nov 2000, Jungshik Shin wrote:

 On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote:
  In any case, I think that UTF-16 is the answer here.
  
  Many people try to compare this to DBCS, but it really is not the same
  thing understanding lead bytes and trail bytes in DBCS is *astoundingly*
  more complicated than handling surrogate pairs.
 
 Well, it depends on what multibyte encoding you're talking about. In case
 of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to
 SJIS(Windows94?), Windows-949(UHC), Windows-950,  WIndows-125x(JOHAB),
 ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about
 the same as UTF-16, I believe, especially in case of   EUC-CN and EUC-KR)

I would move EUC-JP and EUC-TW, and possibly EUC-KR (if you use more than
KS X 1001 in it) to the "complicated" group because of the shifting bytes
required to get to different planes/character sets.


Thomas Chan
[EMAIL PROTECTED]





Re: Java and Unicode

2000-11-15 Thread Roozbeh Pournader



On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote:

 I do not think they are so theoretical, with both 10646 and Unicode
 including them in the very new future (unless you count it as theoretical
 when you drop an egg but it has not yet hit the ground!).

Lemme think. You're saying that when I have not even seen a single egg
hitting the ground, I should believe that it will hit some day? ;)





Re: Java and Unicode

2000-11-15 Thread Jungshik Shin

On Wed, 15 Nov 2000, Thomas Chan wrote:

 On Wed, 15 Nov 2000, Jungshik Shin wrote:
 
  On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote:
   
   Many people try to compare this to DBCS, but it really is not the same
   thing understanding lead bytes and trail bytes in DBCS is *astoundingly*
   more complicated than handling surrogate pairs.
  
  Well, it depends on what multibyte encoding you're talking about. In case
  of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to
  SJIS(Windows94?), Windows-949(UHC), Windows-950,  WIndows-125x(JOHAB),
  ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about
  the same as UTF-16, I believe, especially in case of   EUC-CN and EUC-KR)
 
 I would move EUC-JP and EUC-TW, and possibly EUC-KR (if you use more than
 KS X 1001 in it) to the "complicated" group because of the shifting bytes
 required to get to different planes/character sets.

Well, EUC-KR has never used character sets other than US-ASCII(or
its Korean variant KS X 1003) and KS X 1001 although a theoretical
possibilty is there. More realistic (although very rarely used. there
are only two known implementations :Hanterm - Korean xterm - and Mozilla
)  complication for EUC-KR arises not from a third character set (KS X
1002) in EUC-KR but from 8byte-sequence representation of (11172-2350)
Hangul syllables not covered by the repertoire of KS X 1001.

As for EUC-JP(which uses JIS X 201/US-ASCII, JIS X 208 AND JIS X 0212)
and EUC-TW, I know what you're saying. That's exactly why I added at
the end of my prev. message 'especially in case of EUC-CN and EUC-KR'
:-) Probably, I should have written among multibyte encodings at least
EUC-CN and EUC-KR are as easy to handle as UTF-16.

Jungshik Shin




Re: Java and Unicode

2000-11-15 Thread John Jenkins


On Wednesday, November 15, 2000, at 12:08 PM, Roozbeh Pournader wrote:

 
 
 On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote:
 
  I do not think they are so theoretical, with both 10646 and Unicode
  including them in the very new future (unless you count it as 
 theoretical
  when you drop an egg but it has not yet hit the ground!).
 
 Lemme think. You're saying that when I have not even seen a single egg
 hitting the ground, I should believe that it will hit some day? ;)
 
 

Well, you should be expecting about 45,000 eggs within the next six months.  




Re: Java and Unicode

2000-11-15 Thread Kenneth Whistler

John O'Conner wrote:

 Yes. If you have been involved with Unicode for any period of time at all, you
 would know that the Unicode consortium has advertised Unicode's 16-bit
 encoding for a long, long time, even in its latest Unicode 3.0 spec. The
 Unicode 3.0 spec clearly favors the 16-bit encoding of Unicode code units, and
 the design chapter (chapter 2) never even hints at a 32-bit encoding form.

Indeed. Though, to be fair, people have been talking about UCS-4 and
then UTF-32 for quite awhile now, and the UTF-32 Technical Report has been
approved for half a year.

FYI, on November 9, the Unicode Technical Committee officially voted
to make Unicode Technical Report #19 "UTF-32" a Unicode Standard Annex (UAX).
This will be effective with the rollout of the Unicode Standard, Version
3.1, and will make the 32-bit transformation format a coequal partner
with UTF-16 and UTF-8 as sanctioned Unicode encoding forms.

 
 The previous 2.0 spec (and previous specs as well) promoted this 16-bit
 encoding too...and even claimed that Unicode was a 16-bit, "fixed-width",
 coded character set. There are lots of reasons why Java's char is a 16-bit
 value...the fact that the Unicode Consortium itself has promoted and defined
 Unicode as a 16-bit coded character set for so long is probably the biggest.

It is easy to look back from the year 2000 and wonder why.

But it is also important to remember the context of 1989-1991. During
that time frame, the loudest complaints were from those who were
proclaiming that Unicode's move from 8-bit to 16-bit characters would
break all software, choke the databases, inflate all documents by
a factor of two, and generally end the world as we knew it.

As it turns out, they were wrong on all counts. But the rhetorical
structure of the Unicode Standard was initially set up to be a hard
sell for 16-bit characters *as opposed to* 8-bit characters.

The implementation world has moved on. Now we have an encoding model
for Unicode that embraces an 8-bit, a 16-bit, *and* a 32-bit encoding
form, while acknowledging that the character encoding per se is
effectively 21 bits. This is more complicated than we hoped for
originally, of course, but I think most of us agree that the incremental
complexity in encoding forms is a price we are willing to pay in order
to have a single character encoding standard that can interoperate
in 8-, 16-, and 32-bit environments.

--Ken





Java and Unicode

2000-11-14 Thread Jani Kajala

As Unicode will soon contain characters defined beyond the code point range
[0,65535] I'm wondering how is Java going to handle this?

I didn't find any hints from JDK documentation either, at least a few days
ago when I browsed the Java documentation about internationalization I just
saw a comment that 'Unicode is a 16-bit encoding.' (two errors in one
sentence)


Regards,
Jani Kajala





Re: Java and Unicode

2000-11-14 Thread John O'Conner

You can currently store UTF-16 in the String and StringBuffer classes. However,
all operations are on char values or 16-bit code units. The upcoming release of
the J2SE platform will include support for Unicode 3.0 (maybe 3.0.1)
properties, case mapping, collation, and character break iteration. There is no
explicit support for surrogate pairs in Unicode at this time, although you can
certainly find out if a code unit is a surrogate unit.

In the future, as characters beyond 0x become more important, you can
expect that more robust, official support will ollow.

-- John O'Conner

Jani Kajala wrote:

 As Unicode will soon contain characters defined beyond the code point range
 [0,65535] I'm wondering how is Java going to handle this?

 I didn't find any hints from JDK documentation either, at least a few days
 ago when I browsed the Java documentation about internationalization I just
 saw a comment that 'Unicode is a 16-bit encoding.' (two errors in one
 sentence)

 Regards,
 Jani Kajala




RE: Java, SQL, Unicode and Databases

2000-06-23 Thread Michael Kaplan (Trigeminal Inc.)

Microsoft is very COM-based for its actual data access methods and COM
uses BSTRs that are BOM-less UTF-16. Because of that, the actual storage
format of any database ends up irrelevant since it will be converted to
UTF-16 anyway.

Given that this is what the data layers do, performance is certainly better
if there does not have to be an extra call to the Windows
MutliByteToWideChar to convert UTF-8 to UTF-16. So from a Windows
perspective, not only is it no trouble, but it also the best possible
solution!

In any case, I know plenty of web people who *do* encode their strings in
SQL Server databases as UTF-8 for web applications, since UTF-8 is their
preference. They are willing to take the hit of "converting themselves"
because when data is being read it is faster to go through no conversions at
all.

Michael

 --
 From: [EMAIL PROTECTED][SMTP:[EMAIL PROTECTED]]
 Sent: Friday, June 23, 2000 7:55 AM
 To:   Unicode List
 Cc:   Unicode List; [EMAIL PROTECTED]
 Subject:  Re: Java, SQL, Unicode and Databases
 
 
 
 I think that this is also true for DB2 using UTF-8 as the database
 encoding.
 From an application perspective, MS SQL Server is the one that gives us
 the most
 trouble, because it doesn't support UTF-8 as a database encoding for char,
 etc.
 Joe
 
 Kenneth Whistler [EMAIL PROTECTED] on 06/22/2000 06:42:20 PM
 
 To:   "Unicode List" [EMAIL PROTECTED]
 cc:   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe
 Ross/Tivoli
   Systems)
 Subject:  Re: Java, SQL, Unicode and Databases
 
 
 
 
 Jianping responded:
 
 
  Tex,
 
  Oracle doesn't have special requirement for datatype in JDBC driver if
 you use
 UTF8 as database
  character set. In this case, all the text datatype in JDBC will support
 Unicode data.
 
 
 The same thing is, of course, true for Sybase databases using UTF-8
 at the database character set, accessing them through a JDBC driver.
 
 But I think Tex's question is aimed at the much murkier area
 of what the various database vendors' strategies are for dealing
 with UTF-16 Unicode as a datatype. In that area, the answers for
 what a cross-platform application vendor needs to do and for how
 JDBC drivers might abstract differences in database implementations
 are still unclear.
 
 --Ken
 
 
 



Re: Java, SQL, Unicode and Databases

2000-06-23 Thread Joe_Ross



Yes,  version 7. It requires us to use a different data type (nchar) if we want
to store multilingual text as UTF-16. We want our applications to be database
vendor independent so that customers can use any database under the covers. If
all databases supported UTF-8 as an encoding for char, we could support
multilingual data in the same way for all vendors. As it is, we have to use a
different schema for MS SQL server than we do for the others.
Joe


"Tex Texin" [EMAIL PROTECTED] on 06/23/2000 11:50:06 AM

To:   Joe Ross/Tivoli Systems@Tivoli Systems
cc:   Unicode List [EMAIL PROTECTED], Hossein Kushki@IBMCA, Vladimir Dvorkin
  [EMAIL PROTECTED], Steven Watt [EMAIL PROTECTED]
Subject:  Re: Java, SQL, Unicode and Databases




Joe,

Can you expand on this a bit more? Privately if you prefer.
Do you mean version 7 of MS SQL Server?

I assume if it doesn't have UTF-8, it uses UTF-16. How does this
being the storage encoding, become problematic?
tex


[EMAIL PROTECTED] wrote:

 I think that this is also true for DB2 using UTF-8 as the database encoding.
 From an application perspective, MS SQL Server is the one that gives us the
most
 trouble, because it doesn't support UTF-8 as a database encoding for char,
etc.
 Joe

 Kenneth Whistler [EMAIL PROTECTED] on 06/22/2000 06:42:20 PM

 To:   "Unicode List" [EMAIL PROTECTED]
 cc:   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe
Ross/Tivoli
   Systems)
 Subject:  Re: Java, SQL, Unicode and Databases

 Jianping responded:

 
  Tex,
 
  Oracle doesn't have special requirement for datatype in JDBC driver if you
use
 UTF8 as database
  character set. In this case, all the text datatype in JDBC will support
 Unicode data.
 

 The same thing is, of course, true for Sybase databases using UTF-8
 at the database character set, accessing them through a JDBC driver.

 But I think Tex's question is aimed at the much murkier area
 of what the various database vendors' strategies are for dealing
 with UTF-16 Unicode as a datatype. In that area, the answers for
 what a cross-platform application vendor needs to do and for how
 JDBC drivers might abstract differences in database implementations
 are still unclear.

 --Ken

--


Tex Texin Director, International Products

Progress Software Corp.   +1-781-280-4271
14 Oak Park   +1-781-280-4655 (Fax)
Bedford, MA 01730  USA[EMAIL PROTECTED]

http://www.progress.com   The #1 Embedded Database
http://www.SonicMQ.comJMS Compliant Messaging- Best Middleware
Award
http://www.aspconnections.com Leading provider in the ASP marketplace

Progress Globalization Program (New URL)
http://www.progress.com/partners/globalization.htm


Come to the Panel on Open Source Approaches to Unicode Libraries at
the Sept. Unicode Conference
http://www.unicode.org/iuc/iuc17






RE: Java, SQL, Unicode and Databases

2000-06-23 Thread Michael Kaplan (Trigeminal Inc.)

The datatype *does* matter in that sense you would use UTF-16 data
fields (NTEXT and NCHAR and NVARCHAR) and access it with your favorite data
access method, which will convert as needed to whatever format IS uses. You
will never know oc care what the underlying engine stores.

The web site stuff will not work for you since you would have to do the
extra conversions to do the data mining, so you would probably go with plan
"A".

My general point is that OLE DB to an Oracle UTF-8 field and to a SQL Server
UTF-16 field all return the same type of data UTF-16. So COM in this
case is hiding the differences.

Michael

 --
 From: [EMAIL PROTECTED][SMTP:[EMAIL PROTECTED]]
 Sent: Friday, June 23, 2000 2:27 PM
 To:   Michael Kaplan (Trigeminal Inc.)
 Cc:   Unicode List; [EMAIL PROTECTED]
 Subject:  RE: Java, SQL, Unicode and Databases
 
 
 
 Michael, are you saying that the data type (char or nchar) doesn't matter?
 Are
 you saying that if we just use UTF-16 or wchar_t interfaces to access the
 data
 all will be fine and we will be able to store multilingual data even in
 fields
 defined as char? Maybe things aren't as bad as I feared.
 
 With respect to the web applications you describe, do they store the UTF-8
 as
 binary data? This wouldn't work for us, since we want other data mining
 applications to be able to access the same data.
 
 Thanks,
 Joe
 
 "Michael Kaplan (Trigeminal Inc.)" [EMAIL PROTECTED] on 06/23/2000
 10:41:39 AM
 
 To:   Unicode List [EMAIL PROTECTED], Joe Ross/Tivoli Systems@Tivoli
 Systems
 cc:   Hossein Kushki@IBMCA
 Subject:  RE: Java, SQL, Unicode and Databases
 
 
 
 
 Microsoft is very COM-based for its actual data access methods and COM
 uses BSTRs that are BOM-less UTF-16. Because of that, the actual storage
 format of any database ends up irrelevant since it will be converted to
 UTF-16 anyway.
 
 Given that this is what the data layers do, performance is certainly
 better
 if there does not have to be an extra call to the Windows
 MutliByteToWideChar to convert UTF-8 to UTF-16. So from a Windows
 perspective, not only is it no trouble, but it also the best possible
 solution!
 
 In any case, I know plenty of web people who *do* encode their strings in
 SQL Server databases as UTF-8 for web applications, since UTF-8 is their
 preference. They are willing to take the hit of "converting themselves"
 because when data is being read it is faster to go through no conversions
 at
 all.
 
 Michael
 
  --
  From:   [EMAIL PROTECTED][SMTP:[EMAIL PROTECTED]]
  Sent:   Friday, June 23, 2000 7:55 AM
  To: Unicode List
  Cc: Unicode List; [EMAIL PROTECTED]
  Subject: Re: Java, SQL, Unicode and Databases
 
 
 
  I think that this is also true for DB2 using UTF-8 as the database
  encoding.
  From an application perspective, MS SQL Server is the one that gives us
  the most
  trouble, because it doesn't support UTF-8 as a database encoding for
 char,
  etc.
  Joe
 
  Kenneth Whistler [EMAIL PROTECTED] on 06/22/2000 06:42:20 PM
 
  To:   "Unicode List" [EMAIL PROTECTED]
  cc:   [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Joe
  Ross/Tivoli
Systems)
  Subject:  Re: Java, SQL, Unicode and Databases
 
 
 
 
  Jianping responded:
 
  
   Tex,
  
   Oracle doesn't have special requirement for datatype in JDBC driver if
  you use
  UTF8 as database
   character set. In this case, all the text datatype in JDBC will
 support
  Unicode data.
  
 
  The same thing is, of course, true for Sybase databases using UTF-8
  at the database character set, accessing them through a JDBC driver.
 
  But I think Tex's question is aimed at the much murkier area
  of what the various database vendors' strategies are for dealing
  with UTF-16 Unicode as a datatype. In that area, the answers for
  what a cross-platform application vendor needs to do and for how
  JDBC drivers might abstract differences in database implementations
  are still unclear.
 
  --Ken
 
 
 
 
 
 



Java, SQL, Unicode and Databases

2000-06-22 Thread Tex Texin

I want to write an application in Java that will store information
in a database using Unicode. Ideally the application will run
with any database that supports Unicode. One would presume that the
JDBC driver would take care of any differences between databases
so my application could be independent of database.
(OK, I know it is a naive view.)

However, I am hearing that databases from different vendors require
use of different datatypes or limit you to using certain datatypes
if you want to store Unicode. Changing datatypes would I presume make
a significant different in my programming of the application...

So, I want to make a list of the changes I need to make to 
my Java, SQL application in the event I want to
support each of the major databases (Oracle 8I, MS SQL Server 7, 
etc.) with respect to Unicode data storage.

(I am sure there are other differences programming to different
databases, independent of Unicode data, but those issues are
understood.)

So, if you can help me by identifying specific changes you would make
to query or update a major vendor's database with respect to Unicode
support, I would be very appreciative. If I get a good list, I'll
post it back here. I am most interested in Oracle and MS SQL Server,
but will collect info on any database.

As an example, I am hearing that some databases would require varchar,
others nchar, for Unicode data.

tex


-- 

Tex Texin Director, International Products
 
Progress Software Corp.   +1-781-280-4271
14 Oak Park   +1-781-280-4655 (Fax)
Bedford, MA 01730  USA[EMAIL PROTECTED]

http://www.progress.com   The #1 Embedded Database
http://www.SonicMQ.comJMS Compliant Messaging- Best Middleware
Award
http://www.aspconnections.com Leading provider in the ASP marketplace

Progress Globalization Program (New URL)
http://www.progress.com/partners/globalization.htm

Come to the Panel on Open Source Approaches to Unicode Libraries at
the Sept. Unicode Conference
http://www.unicode.org/iuc/iuc17