Re: [totally OT] Unicode terminology (was Re: string vs. char [was Re: Java and Unicode])
If the difference between "A" and "a" is called "case", what is the difference between HIRAGANA LETTER YA and KATAKANA LETTER YA called? (I think either of those letters would do to describe this with the new code pages. The description would be enhanced by liberal application of HIRAGANA-KATAKANA LONG VOWEL MARK.) I like "Astral Planes" better. Will they include INUKTITUT VIGESIMAL DIGITs? I should have voted Sarasvati for US President. Instead I voted for Saotome Nodoka. | ||\ __/__ | | _/_ | || / | _|_ ,--, / \ /_| -+- / --- | / |V T_)| | |\ | ||/ _ \_/ T / \ / __/ | /--- \_/ L/ \ Marco Cimarosti <[EMAIL PROTECTED]> wrote: > David Starner wrote: > > Sent: 20 Nov 2000, Mon 16.18 > > To: Unicode List > > Subject: Re: string vs. char [was Re: Java and > Unicode] > > > > On Mon, Nov 20, 2000 at 06:54:27AM -0800, Michael > (michka) > > Kaplan wrote: > > > From: "Marco Cimarosti" <[EMAIL PROTECTED]> > > > > > > > the Surrograte (aka "Astral") Planes. > > > > > > I believe the UTC has deprecated the term Astral > planes with extreme > > > prejudice. HTH! > > > > The UTC has chosen not use the term Astral Plane. > Keeping > > that in mind, > > I can chose to use whatever terms I want, realizing > of course > > that some > > may not get my point across. The UTC chose Surrogate > Planes > > for perceived > > functionality and translatability; I chose Astral > Planes for > > perceived grace and beauty. > > Well, I am not as angrily pro "Astral Planes" as > David is, but I too find > the humorous term prettier than the official one. > And I used it because I > think that a few people on this list may still > find it clearer than the > official "Surrogate Planes" -- which is more serious > and descriptive, but > still relatively new to many. > > Moreover, although my attitude towards the UTC I thought UTC meant Universal Coordinated Time, like this: UTC 2000a11l22d13h02m. > (the "government" of Unicode) > is much more friendly than my attitude towards > real governments out there > (if people like J. Jenkinks or M. Davis were the > President of the USA this > would be a much nicer world!), still I don't feel > quite like obeying to any > government's orders, prohoibitions or deprecations > without opposing the due > resistance. > > 8-) (<-- smiley wearing anti-tear-gas glasses) > > _ Marco > > __ > La mia e-mail è ora: My e-mail is now: > >>> marco.cimarostiªeurope.com <<< > (Cambiare "ª" in "@") (Change "ª" to "@") > > > __ > FREE Personalized Email at Mail.com > Sign up at http://www.mail.com/?sr=signup > ___ Get your own FREE Bolt Onebox - FREE voicemail, email, and fax, all in one place - sign up at http://www.bolt.com
Re: string vs. char [was Re: Java and Unicode]
On Mon, Nov 20, 2000 at 09:36:08AM -0800, Mark Davis wrote: > The UTC will be using the terms "supplementary code points", "supplementary > characters" and "supplementary planes". The term it is "deprecating with > extreme prejudice" is "surrogate characters". > > See http://www.unicode.org/glossary/ for more information. Thats good, as it is more in consistence with IS 10646. 10646 does not use the term "surrogate". Keld
Re: string vs. char [was Re: Java and Unicode]
Please keep in mind my sidepoint: > Antoine Leca wrote: > > > Please note that I left aside UTF-16, because I am not clear > > if 16-bit are adequate or not to code UTF-16 in wchar_t (in other words, if > > wchar_t can be a multiwide encoding). Marco Cimarosti wrote [with minor editing to keep it to the point]: > > > > wchar_t * _wcschr_32(const wchar_t * s, wint_t c); > > > wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c); > > > size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs); > > > > What is the point? > > But if «c >= 0x1000[0]», then the character would be represented in «s» (an > UTF-16 string) by a surrogate pair, and the function would thus return the > address of the *high surrogate*. > > E.g., assuming that «s» is «{0x2190, 0xD800, 0xDC05, 0x2192, 0x}» and > «c» is 0x10[0]05, both functions would return «&s[1]»: the address of the high > surrogate 0xD800. As I said, I am unsure UTF-16 is legal for wchar_t. If it is, I will agree with you. But the main point of the people that say "UTF-16 is illegal for wchar_t" is just this one: there are some case that are not handled nicely by the current API. Antoine
Re: string vs. char [was Re: Java and Unicode]
The UTC will be using the terms "supplementary code points", "supplementary characters" and "supplementary planes". The term it is "deprecating with extreme prejudice" is "surrogate characters". See http://www.unicode.org/glossary/ for more information. Mark - Original Message - From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Sent: Monday, November 20, 2000 06:54 Subject: Re: string vs. char [was Re: Java and Unicode] > From: "Marco Cimarosti" <[EMAIL PROTECTED]> > > > the Surrograte (aka "Astral") Planes. > > I believe the UTC has deprecated the term Astral planes with extreme > prejudice. HTH! > > michka > > a new book on internationalization in VB at > http://www.i18nWithVB.com/ > > >
Re: string vs. char [was Re: Java and Unicode]
Hi Jani, I dunno. I oversimplified in that statement about exposing vs. hiding. ICU "hides" the facts about the Unicode implementation in macros, specifically a next and previous character macro and various other fillips. If you look very closely at the function (method) prototypes you can see that, in fact, a "character" is a 32-bit entity and a string is made (conditionally) of 16-bit entities. But, as you suggest, ICU makes it easy to work with (and is set up so that a sufficiently motivated coder could change the internal encoding). If you ask a 100 programmers the index of the string, they'll give you the wrong answer 99 times... because there is little or no I18n training in the course of becoming a programmer. The members of this list are continually ground down by the sheer inertia of ignorance (I just gave up answering one about email... I must have written a response to that message a bunch of times, but don't have the time or stamina this morning to go find and rework one of them). In any case this has been a fun and instructive interlude. As I said in my initial email, I tend to be a CONSUMER of Unicode APIs rather than a creator. I haven't written a Unicode support package in quite some time (and the last one was a UTF-8 hack in C++). It's good to be familiar with the details, but I find that, as a programmer one typically doesn't fully comprehend the design decisions until one faces them oneself. As it is, I ended up changing my design and sample code over the weekend to follow the suggestions of several on this list who've Been There. As a side note: one of the problems I faced on this project was the need to keep the Unicode and locale libraries extremely small (this is an embedded OS). I would happily have borrowed ICU to actually *be* the library... but it's too large. I've had to design a tiny (and therefore quite limited) support library. It's been an interesting experience. Best Regards, Addison === Addison P. PhillipsPrincipal Consultant Inter-Locale LLChttp://www.inter-locale.com Los Gatos, CA, USA mailto:[EMAIL PROTECTED] +1 408.210.3569 (mobile) +1 408.904.4762 (fax) === Globalization Engineering & Consulting Services On Mon, 20 Nov 2000, Jani Kajala wrote: > > > >The question, I guess, boils down to: put it in the interface, or hide it > > >in the internals. ICU exposes it. My spec, up to this point, hides it, > > (I'm aware that the original question was about C interfaces so you might consider >this a bit out of topic but I just wanted to comment about the exposed encoding) > > I think that exposing encoding in interfaces doesn't do any good. It violates >oriented design principles and it is not even intuitive. > > I'd bet that if we take 100 programmers and ask them 'What is this index in context >of this string?' in every case we'll get an answer that its of course the nth >character position. Nobody who isn't well aware of character encoding will ever think >of code units. Thus, it is not intuitive to use indices to point at code units. >Especially as Unicode has been so well-marketed as '16-bit character set'. > > Besides, you can always use (C++ style) iterators instead of plain indices without >any loss in performance or in syntactic convenience. With an 'iterator' in this I >refer to simple encapsulated pointer which behaves just as any C++ Standard Template >Library random access iterator but takes encoding into account. Example: > > for ( String::Iterator i = s.begin() ; i != s.end() ; ++i ) > // ith character in s = *i > // i+nth character in s = i[n] > > The solution works with any encoding as long as string::iterator is defined properly. > > The conclusion that using indices won't make a difference in performance also makes >sense if you consider the basic underlying task: If you need random access to a >string you need to check for characters spanning over multiple code units. So the >task is the same O(n) complexity, using indices won't help a bit. If the user needs >the access to arbitrary character he needs to iterate anyway. It is just matter how >you want to encapsulate the task. > > > Regards, > Jani Kajala > http://www.helsinki.fi/~kajala/ > >
[totally OT] Unicode terminology (was Re: string vs. char [was Re: Java and Unicode])
David Starner wrote: > Sent: 20 Nov 2000, Mon 16.18 > To: Unicode List > Subject: Re: string vs. char [was Re: Java and Unicode] > > On Mon, Nov 20, 2000 at 06:54:27AM -0800, Michael (michka) > Kaplan wrote: > > From: "Marco Cimarosti" <[EMAIL PROTECTED]> > > > > > the Surrograte (aka "Astral") Planes. > > > > I believe the UTC has deprecated the term Astral planes with extreme > > prejudice. HTH! > > The UTC has chosen not use the term Astral Plane. Keeping > that in mind, > I can chose to use whatever terms I want, realizing of course > that some > may not get my point across. The UTC chose Surrogate Planes > for perceived > functionality and translatability; I chose Astral Planes for > perceived grace and beauty. Well, I am not as angrily pro "Astral Planes" as David is, but I too find the humorous term prettier than the official one. And I used it because I think that a few people on this list may still find it clearer than the official "Surrogate Planes" -- which is more serious and descriptive, but still relatively new to many. Moreover, although my attitude towards the UTC (the "government" of Unicode) is much more friendly than my attitude towards real governments out there (if people like J. Jenkinks or M. Davis were the President of the USA this would be a much nicer world!), still I don't feel quite like obeying to any government's orders, prohoibitions or deprecations without opposing the due resistance. 8-) (<-- smiley wearing anti-tear-gas glasses) _ Marco __ La mia e-mail è ora: My e-mail is now: >>> marco.cimarostiªeurope.com <<< (Cambiare "ª" in "@") (Change "ª" to "@") __ FREE Personalized Email at Mail.com Sign up at http://www.mail.com/?sr=signup
Re: string vs. char [was Re: Java and Unicode]
David Starner wrote: > I chose Astral Planes for perceived grace > and beauty. Thank you! -- There is / one art || John Cowan <[EMAIL PROTECTED]> no more / no less|| http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein
Re: string vs. char [was Re: Java and Unicode]
I think the issue is more one of the semantic meaning that terms like astral, imaginary, irrational, or other such terms bring to the table? Refusing to potentially insult the people who place importance on the characters that will be encoded on places on than the BMP is a thing of grace and beauty (much moreso than the insult would be!). I think the UTC action is a responsible one. (Just my two cents) michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "David Starner" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Sent: Monday, November 20, 2000 7:18 AM Subject: Re: string vs. char [was Re: Java and Unicode] > On Mon, Nov 20, 2000 at 06:54:27AM -0800, Michael (michka) Kaplan wrote: > > From: "Marco Cimarosti" <[EMAIL PROTECTED]> > > > > > the Surrograte (aka "Astral") Planes. > > > > I believe the UTC has deprecated the term Astral planes with extreme > > prejudice. HTH! > > The UTC has chosen not use the term Astral Plane. Keeping that in mind, > I can chose to use whatever terms I want, realizing of course that some > may not get my point across. The UTC chose Surrogate Planes for perceived > functionality and translatability; I chose Astral Planes for perceived grace > and beauty. > > -- > David Starner - [EMAIL PROTECTED] > http://dvdeug.dhis.org > Looking for a Debian developer in the Stillwater, Oklahoma area > to sign my GPG key >
Re: string vs. char [was Re: Java and Unicode]
On Mon, Nov 20, 2000 at 06:54:27AM -0800, Michael (michka) Kaplan wrote: > From: "Marco Cimarosti" <[EMAIL PROTECTED]> > > > the Surrograte (aka "Astral") Planes. > > I believe the UTC has deprecated the term Astral planes with extreme > prejudice. HTH! The UTC has chosen not use the term Astral Plane. Keeping that in mind, I can chose to use whatever terms I want, realizing of course that some may not get my point across. The UTC chose Surrogate Planes for perceived functionality and translatability; I chose Astral Planes for perceived grace and beauty. -- David Starner - [EMAIL PROTECTED] http://dvdeug.dhis.org Looking for a Debian developer in the Stillwater, Oklahoma area to sign my GPG key
Re: string vs. char [was Re: Java and Unicode]
From: "Marco Cimarosti" <[EMAIL PROTECTED]> > the Surrograte (aka "Astral") Planes. I believe the UTC has deprecated the term Astral planes with extreme prejudice. HTH! michka a new book on internationalization in VB at http://www.i18nWithVB.com/
Re: string vs. char [was Re: Java and Unicode]
Antoine Leca wrote: > Marco Cimarosti wrote: > > Actually, C does have different types for characters within > strings and for > > characters in isolation. > > That is not my point of view. > There is a special case for 'H', that holds int type rather > than char, for > backward compatibility reasons (such as because the first > versions of C were > not able to deal correctly with to-be-promoted arguments). > Similarly, a > number of (old) functions use int for the character arguments. > Then, there is the point of view that int represents _either_ a valid > character, _or_ an error indication (EOF). This is the reason > that makes > int used for the return type of fgetc. OK. > Outside this, a string is clearly an array of characters, and > characters are > stored using the type char (or one of the sign alternatives). > As a result, > you can write 'H' either as such, or as "Hello, world!\n"[0]. OK. > > The type of a string literal (e.g. "Hello world!\n") is > "array of char", > > while the type of a character literal (e.g. 'H') is "int". > > > > This distinction is generally reflected also in the C > library, so that you > > don't get compiler warnings when passing character > constants to functions. > > You need not, since C considers character to be (small) > integers, which eases > passing of arguments. This is unrelated to the issue. OK. I was just describing the background. > > This distinction has been retained also in the newer "wide character > > library": "wchar_t" is the wide equivalent of "char", while > "wint_t" is the > > wide equivalent of "int". > > Not exactly. The wide versions has the same distinction as > the narrow one for > the second case above (finding errors), but not for the first > one (promoting). OK. > > The wide version of the examples above is: > > > > int fputws(const wchar_t * c, FILE * stream); > > wint_t fputwc(wint_t c, FILE * stream); > ^^ > Instead, we have > int fputws(const wchar_t * s, FILE * stream); > wint_t fputwc(wchar_t c, FILE *stream); > > It shows clearly that c cannot hold the WEOF value. OTOH, the > returned value > _can_ be the error indication WEOF, so the type is wint_t. Oops! Sorry. I had two versions of "wchar.h" at hand: one (lovingly crafted by myself) had «wchar_t c»; the other one (shipped with Microsoft Visual C++ 6.0) had: _CRTIMP wint_t __cdecl fputwc(wint_t, FILE *); I did the mistake to trust the second one. :-) > Similarly, the type of L'H' is wchar_t. You gave other > examples in your "But". > > > int iswalpha(wint_t c); > > Here, the iswalpha is intended to be able to test valid > characters as well > as the error indication, so the type is wint_t; here WEOF is > specifically allowed. OK. > > In an Unicode implementation of the "wide character > library" (wchar.h and > > wctype.h), this difference may be exploited to use > different UTF's for > > strings and characters: > > Ah, now we go into the interresting field. > Please note that I left aside UTF-16, because I am not clear > if 16-bit are > adequate or not to code UTF-16 in wchar_t (in other words, if > wchar_t can be > a multiwide encoding). > > > typedef unsigned short wchar_t; > > /* UTF-16 character, used within string. */ > > > > typedef unsigned long wint_t; > > /* UTF-32 character, used for handling isolated characters. */ > > To date, no problem. > > > But, unluckily, there is a "but". Type "wchar_t" is *also* > used for isolated > > character in a couple of stupid APIs: > > See above for another example: fputwc... > > > But I think that changing those "wchar_t c" to "wint_t c" > is a smaller > > "violence" to the standards than changing them to "const > wchar_t * c". > > ;-) OK, my trick is dirty as well, just a bit easier to hide. ;-) > > And you can also implement it in an elegant, quasi-standard way: > > > wchar_t * _wcschr_32(const wchar_t * s, wint_t c); > > wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c); > > size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs); > > What is the point? You cannot pass to these anything other than values > between 0 (WCHAR_MIN) and WCHAR_MAX anyway. And there are no really > "interesting" ways to extend the meaning of these functions outside > this range. > Or do I miss something? _wcschr_32 and _wcsrchr_32 would return a pointer to the first (or last) occurrence of the specified character in the string, just like their standard counterparts. But if «c >= 0x1000», then the character would be represented in «s» (an UTF-16 string) by a surrogate pair, and the function would thus return the address of the *high surrogate*. E.g., assuming that «s» is «{0x2190, 0xD800, 0xDC05, 0x2192, 0x}» and «c» is 0x1005, both functions would return «&s[1]»: the address of the high surrogate 0xD800. Similarly for _wcrtomb_32(): assuming that «s» points into an UTF-8 string, the function would insert in «s» the 3-octets UTF-8 sequence corresponding to «c». > > #ifdef PEDANTIC_STANDARD > > wchar_t * w
Re: string vs. char [was Re: Java and Unicode]
Marco Cimarosti wrote: > > Actually, C does have different types for characters within strings and for > characters in isolation. That is not my point of view. There is a special case for 'H', that holds int type rather than char, for backward compatibility reasons (such as because the first versions of C were not able to deal correctly with to-be-promoted arguments). Similarly, a number of (old) functions use int for the character arguments. Then, there is the point of view that int represents _either_ a valid character, _or_ an error indication (EOF). This is the reason that makes int used for the return type of fgetc. Outside this, a string is clearly an array of characters, and characters are stored using the type char (or one of the sign alternatives). As a result, you can write 'H' either as such, or as "Hello, world!\n"[0]. > The type of a string literal (e.g. "Hello world!\n") is "array of char", > while the type of a character literal (e.g. 'H') is "int". > > This distinction is generally reflected also in the C library, so that you > don't get compiler warnings when passing character constants to functions. You need not, since C considers character to be (small) integers, which eases passing of arguments. This is unrelated to the issue. > This distinction has been retained also in the newer "wide character > library": "wchar_t" is the wide equivalent of "char", while "wint_t" is the > wide equivalent of "int". Not exactly. The wide versions has the same distinction as the narrow one for the second case above (finding errors), but not for the first one (promoting). > The wide version of the examples above is: > > int fputws(const wchar_t * c, FILE * stream); > wint_t fputwc(wint_t c, FILE * stream); ^^ Instead, we have int fputws(const wchar_t * s, FILE * stream); wint_t fputwc(wchar_t c, FILE *stream); It shows clearly that c cannot hold the WEOF value. OTOH, the returned value _can_ be the error indication WEOF, so the type is wint_t. Similarly, the type of L'H' is wchar_t. You gave other examples in your "But". > int iswalpha(wint_t c); Here, the iswalpha is intended to be able to test valid characters as well as the error indication, so the type is wint_t; here WEOF is specifically allowed. > In an Unicode implementation of the "wide character library" (wchar.h and > wctype.h), this difference may be exploited to use different UTF's for > strings and characters: Ah, now we go into the interresting field. Please note that I left aside UTF-16, because I am not clear if 16-bit are adequate or not to code UTF-16 in wchar_t (in other words, if wchar_t can be a multiwide encoding). > typedef unsigned short wchar_t; > /* UTF-16 character, used within string. */ > > typedef unsigned long wint_t; > /* UTF-32 character, used for handling isolated characters. */ To date, no problem. > But, unluckily, there is a "but". Type "wchar_t" is *also* used for isolated > character in a couple of stupid APIs: See above for another example: fputwc... > But I think that changing those "wchar_t c" to "wint_t c" is a smaller > "violence" to the standards than changing them to "const wchar_t * c". ;-) > And you can also implement it in an elegant, quasi-standard way: > wchar_t * _wcschr_32(const wchar_t * s, wint_t c); > wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c); > size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs); What is the point? You cannot pass to these anything other than values between 0 (WCHAR_MIN) and WCHAR_MAX anyway. And there are no really "interesting" ways to extend the meaning of these functions outside this range. Or do I miss something? > #ifdef PEDANTIC_STANDARD > wchar_t * wcschr(const wchar_t * s, wchar_t c); > wchar_t * wcsrchr(const wchar_t * s, wchar_t c); > size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs); > #else > #define wcschr _wcschr_32 > #define wcsrchr _wcsrchr_32 > #define wcrtomb _wcrtomb_32 > #endif
RE: string vs. char [was Re: Java and Unicode]
Well... I think you're right. I knew that char and string units weren't really the same thing. My concern was how to make it easy on developers to use the Unicode API using their "native intelligence". More thought makes me less certain of my approach. Specifically, as Mark points out, looping structures are much more ugly when using pointers. And I still have to have all of the code for the scalar value conversion. It's better to force a casting macro on the developers than trying to do their dirty work for them. Fooling developers who use your API is a Bad Idea, usually. Thanks again for the feedback. Addison === Addison P. PhillipsPrincipal Consultant Inter-Locale LLChttp://www.inter-locale.com Los Gatos, CA, USA mailto:[EMAIL PROTECTED] +1 408.210.3569 (mobile) +1 408.904.4762 (fax) === Globalization Engineering & Consulting Services On Fri, 17 Nov 2000, Marco Cimarosti wrote: > Addison P. Phillips wrote: > > I ended up deciding that the Unicode API for this OS will only work in > > strings. CTYPE replacement functions (such as isalpha) and > > character based > > replacement functions (such as strchr) will take and return > > strings for > > all of their arguments. > > > > Internally, my functions are converting the pointed character to its > > scalar value (to look it up in the database most efficiently). > > > > This isn't very satisfying. It goes somewhat against the grain of 'C' > > programming. But it's equally unsatisfying to use a 32-bit > > representation > > for a character and a 16-bit representation for a string, > > because in 'C', > > a string *is* an array of characters. Which is more > > natural? Which is more common? Iterating across an array of > > 16-bit values > > or > > Actually, C does have different types for characters within strings and for > characters in isolation. > > The type of a string literal (e.g. "Hello world!\n") is "array of char", > while the type of a character literal (e.g. 'H') is "int". > > This distinction is generally reflected also in the C library, so that you > don't get compiler warnings when passing character constants to functions. > > E.g., compare the following functions from : > > int fputs(const char * s, FILE * stream); > int fputc(int c, FILE * stream); > > The same convention is generally used through the C library, not only in the > I/O functions. E.g.: > > int isalpha(int c); > int tolower(int c); > > This distinction has been retained also in the newer "wide character > library": "wchar_t" is the wide equivalent of "char", while "wint_t" is the > wide equivalent of "int". > > The wide version of the examples above is: > > int fputws(const wchar_t * c, FILE * stream); > wint_t fputwc(wint_t c, FILE * stream); > > int iswalpha(wint_t c); > wint_t towlower(wint_t c); > > In an Unicode implementation of the "wide character library" (wchar.h and > wctype.h), this difference may be exploited to use different UTF's for > strings and characters: > > typedef unsigned short wchar_t; > /* UTF-16 character, used within string. */ > > typedef unsigned long wint_t; > /* UTF-32 character, used for handling isolated characters. */ > > But, unluckily, there is a "but". Type "wchar_t" is *also* used for isolated > character in a couple of stupid APIs: > > wchar_t * wcschr(const wchar_t * s, wchar_t c); > wchar_t * wcsrchr(const wchar_t * s, wchar_t c); > size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs); > > BTW, the blunder in wcschr() and wcsrchr() is inherited from their "narrow" > ancestors: strchr() and strrchr(). > > But I think that changing those "wchar_t c" to "wint_t c" is a smaller > "violence" to the standards than changing them to "const wchar_t * c". > And you can also implement it in an elegant, quasi-standard way: > > wchar_t * _wcschr_32(const wint_t * s, wchar_t c); > wchar_t * _wcsrchr_32(const wint_t * s, wchar_t c); > size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs); > > #ifdef PEDANTIC_STANDARD > wchar_t * wcschr(const wchar_t * s, wchar_t c); > wchar_t * wcsrchr(const wchar_t * s, wchar_t c); > size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs); > #else > #define wcschr _wcschr_32 > #define wcsrchr _wcsrchr_32 > #define wcrtomb _wcrtomb_32 > #endif > > I would like to see the opinion of C standardization experts (e.g. A. Leca) > about this forcing of the C standard. > > _ Marco. > > __ > La mia e-mail è ora: My e-mail is now: > >>> marco.cimarostiªeurope.com <<< > (Cambiare "ª" in "@") (Change "ª" to "@") > > > __ > FREE Personalized Email at Mail.com > Sign up at http://www.mail.com/?sr=signup >
Re: string vs. char [was Re: Java and Unicode]
Thanks Mark. I've looked extensively at the ICU code in doing much of the design on this system. What my email didn't end up saying was, basically, that the "char" functions end up decoding a scalar value internally in a 32-bit integer value. The question, I guess, boils down to: put it in the interface, or hide it in the internals. ICU exposes it. My spec, up to this point, hides it, because I think that programmers will be working with strings more often than with individual characters and that perhaps this will seem more "natural". Addison === Addison P. PhillipsPrincipal Consultant Inter-Locale LLChttp://www.inter-locale.com Los Gatos, CA, USA mailto:[EMAIL PROTECTED] +1 408.210.3569 (mobile) +1 408.904.4762 (fax) === Globalization Engineering & Consulting Services On Thu, 16 Nov 2000, Mark Davis wrote: > We have found that it works pretty well to have a uchar32 datatype, with > uchar16 storage in strings. In ICU (C version) we use macros for efficient > access; in ICU (C++) version we use method calls, and for ICU (Java version) > we have a set of utility static methods (since we can't add to the Java > String API). > > With these functions, the number of changes that you have to make to > existing code is fairly small, and you don't have to change the way that > loops are set up, for example. > > Mark > > - Original Message - > From: <[EMAIL PROTECTED]> > To: "Unicode List" <[EMAIL PROTECTED]> > Sent: Thursday, November 16, 2000 13:24 > Subject: string vs. char [was Re: Java and Unicode] > > > > Normally this thread would be of only academic interest to me... > > > > ...but this week I'm writing a spec for adding Unicode support to an > > embedded operating system written in C. Due to Mssrs. O'Conner and > > Scherer's presentations at the most recent IUC, I was aware of the clash > > between internal string representations and the Unicode Scalar Value > > necessary for efficient lookup. > > > > Now I'm getting alarmed about the solution I've selected. > > > > The OS I'm working on is written in C. I considered, therefore, using > > UTF-8 as the internal Unicode representation (because I don't have the > > option of #defining Unicode and using wchar), but the storage expansion > > and the fact that several existing modules grok UTF-16 (well, UCS-2), led > > me to go in the direction of UTF-16. > > > > I also considered supporting only UCS-2. It's a bad bad bad idea, but it > > gets me out of the following: > > > > I ended up deciding that the Unicode API for this OS will only work in > > strings. CTYPE replacement functions (such as isalpha) and character based > > replacement functions (such as strchr) will take and return strings for > > all of their arguments. > > > > Internally, my functions are converting the pointed character to its > > scalar value (to look it up in the database most efficiently). > > > > This isn't very satisfying. It goes somewhat against the grain of 'C' > > programming. But it's equally unsatisfying to use a 32-bit representation > > for a character and a 16-bit representation for a string, because in 'C', > > a string *is* an array of characters. Which is more > > natural? Which is more common? Iterating across an array of 16-bit values > > or > > > > === > > Addison P. PhillipsPrincipal Consultant > > Inter-Locale LLChttp://www.inter-locale.com > > Los Gatos, CA, USA mailto:[EMAIL PROTECTED] > > > > +1 408.210.3569 (mobile) +1 408.904.4762 (fax) > > === > > Globalization Engineering & Consulting Services > > > > > > > >
RE: string vs. char [was Re: Java and Unicode]
Ooops! In my previous message, I wrote: > wchar_t * _wcschr_32(const wint_t * s, wchar_t c); > wchar_t * _wcsrchr_32(const wint_t * s, wchar_t c); What I actually wanted to write is: wchar_t * _wcschr_32(const wchar_t * s, wint_t c); wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c); Sorry if this puzzled you. _ Marco __ La mia e-mail è ora: My e-mail is now: >>> marco.cimarostiªeurope.com <<< (Cambiare "ª" in "@") (Change "ª" to "@") __ FREE Personalized Email at Mail.com Sign up at http://www.mail.com/?sr=signup
RE: string vs. char [was Re: Java and Unicode]
Addison P. Phillips wrote: > I ended up deciding that the Unicode API for this OS will only work in > strings. CTYPE replacement functions (such as isalpha) and > character based > replacement functions (such as strchr) will take and return > strings for > all of their arguments. > > Internally, my functions are converting the pointed character to its > scalar value (to look it up in the database most efficiently). > > This isn't very satisfying. It goes somewhat against the grain of 'C' > programming. But it's equally unsatisfying to use a 32-bit > representation > for a character and a 16-bit representation for a string, > because in 'C', > a string *is* an array of characters. Which is more > natural? Which is more common? Iterating across an array of > 16-bit values > or Actually, C does have different types for characters within strings and for characters in isolation. The type of a string literal (e.g. "Hello world!\n") is "array of char", while the type of a character literal (e.g. 'H') is "int". This distinction is generally reflected also in the C library, so that you don't get compiler warnings when passing character constants to functions. E.g., compare the following functions from : int fputs(const char * s, FILE * stream); int fputc(int c, FILE * stream); The same convention is generally used through the C library, not only in the I/O functions. E.g.: int isalpha(int c); int tolower(int c); This distinction has been retained also in the newer "wide character library": "wchar_t" is the wide equivalent of "char", while "wint_t" is the wide equivalent of "int". The wide version of the examples above is: int fputws(const wchar_t * c, FILE * stream); wint_t fputwc(wint_t c, FILE * stream); int iswalpha(wint_t c); wint_t towlower(wint_t c); In an Unicode implementation of the "wide character library" (wchar.h and wctype.h), this difference may be exploited to use different UTF's for strings and characters: typedef unsigned short wchar_t; /* UTF-16 character, used within string. */ typedef unsigned long wint_t; /* UTF-32 character, used for handling isolated characters. */ But, unluckily, there is a "but". Type "wchar_t" is *also* used for isolated character in a couple of stupid APIs: wchar_t * wcschr(const wchar_t * s, wchar_t c); wchar_t * wcsrchr(const wchar_t * s, wchar_t c); size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs); BTW, the blunder in wcschr() and wcsrchr() is inherited from their "narrow" ancestors: strchr() and strrchr(). But I think that changing those "wchar_t c" to "wint_t c" is a smaller "violence" to the standards than changing them to "const wchar_t * c". And you can also implement it in an elegant, quasi-standard way: wchar_t * _wcschr_32(const wint_t * s, wchar_t c); wchar_t * _wcsrchr_32(const wint_t * s, wchar_t c); size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs); #ifdef PEDANTIC_STANDARD wchar_t * wcschr(const wchar_t * s, wchar_t c); wchar_t * wcsrchr(const wchar_t * s, wchar_t c); size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs); #else #define wcschr _wcschr_32 #define wcsrchr _wcsrchr_32 #define wcrtomb _wcrtomb_32 #endif I would like to see the opinion of C standardization experts (e.g. A. Leca) about this forcing of the C standard. _ Marco. __ La mia e-mail è ora: My e-mail is now: >>> marco.cimarostiªeurope.com <<< (Cambiare "ª" in "@") (Change "ª" to "@") __ FREE Personalized Email at Mail.com Sign up at http://www.mail.com/?sr=signup
Re: string vs. char [was Re: Java and Unicode]
We have found that it works pretty well to have a uchar32 datatype, with uchar16 storage in strings. In ICU (C version) we use macros for efficient access; in ICU (C++) version we use method calls, and for ICU (Java version) we have a set of utility static methods (since we can't add to the Java String API). With these functions, the number of changes that you have to make to existing code is fairly small, and you don't have to change the way that loops are set up, for example. Mark - Original Message - From: <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Sent: Thursday, November 16, 2000 13:24 Subject: string vs. char [was Re: Java and Unicode] > Normally this thread would be of only academic interest to me... > > ...but this week I'm writing a spec for adding Unicode support to an > embedded operating system written in C. Due to Mssrs. O'Conner and > Scherer's presentations at the most recent IUC, I was aware of the clash > between internal string representations and the Unicode Scalar Value > necessary for efficient lookup. > > Now I'm getting alarmed about the solution I've selected. > > The OS I'm working on is written in C. I considered, therefore, using > UTF-8 as the internal Unicode representation (because I don't have the > option of #defining Unicode and using wchar), but the storage expansion > and the fact that several existing modules grok UTF-16 (well, UCS-2), led > me to go in the direction of UTF-16. > > I also considered supporting only UCS-2. It's a bad bad bad idea, but it > gets me out of the following: > > I ended up deciding that the Unicode API for this OS will only work in > strings. CTYPE replacement functions (such as isalpha) and character based > replacement functions (such as strchr) will take and return strings for > all of their arguments. > > Internally, my functions are converting the pointed character to its > scalar value (to look it up in the database most efficiently). > > This isn't very satisfying. It goes somewhat against the grain of 'C' > programming. But it's equally unsatisfying to use a 32-bit representation > for a character and a 16-bit representation for a string, because in 'C', > a string *is* an array of characters. Which is more > natural? Which is more common? Iterating across an array of 16-bit values > or > > === > Addison P. PhillipsPrincipal Consultant > Inter-Locale LLChttp://www.inter-locale.com > Los Gatos, CA, USA mailto:[EMAIL PROTECTED] > > +1 408.210.3569 (mobile) +1 408.904.4762 (fax) > === > Globalization Engineering & Consulting Services > > >
string vs. char [was Re: Java and Unicode]
Normally this thread would be of only academic interest to me... ...but this week I'm writing a spec for adding Unicode support to an embedded operating system written in C. Due to Mssrs. O'Conner and Scherer's presentations at the most recent IUC, I was aware of the clash between internal string representations and the Unicode Scalar Value necessary for efficient lookup. Now I'm getting alarmed about the solution I've selected. The OS I'm working on is written in C. I considered, therefore, using UTF-8 as the internal Unicode representation (because I don't have the option of #defining Unicode and using wchar), but the storage expansion and the fact that several existing modules grok UTF-16 (well, UCS-2), led me to go in the direction of UTF-16. I also considered supporting only UCS-2. It's a bad bad bad idea, but it gets me out of the following: I ended up deciding that the Unicode API for this OS will only work in strings. CTYPE replacement functions (such as isalpha) and character based replacement functions (such as strchr) will take and return strings for all of their arguments. Internally, my functions are converting the pointed character to its scalar value (to look it up in the database most efficiently). This isn't very satisfying. It goes somewhat against the grain of 'C' programming. But it's equally unsatisfying to use a 32-bit representation for a character and a 16-bit representation for a string, because in 'C', a string *is* an array of characters. Which is more natural? Which is more common? Iterating across an array of 16-bit values or === Addison P. PhillipsPrincipal Consultant Inter-Locale LLChttp://www.inter-locale.com Los Gatos, CA, USA mailto:[EMAIL PROTECTED] +1 408.210.3569 (mobile) +1 408.904.4762 (fax) === Globalization Engineering & Consulting Services
Re: Java and Unicode
Juliusz Chroboczek wrote: > I believe that Java strings use UTF-8 internally. .class files use a _modified_ utf-8. at runtime, strings are always in 16-bit unicode. > At any rate the > internal implementation is not exposed to applications -- note that > `length' is a method in class String (while it is a field in vector > classes). but length() and charAt() are some of the apis that expose that the internal representation is in 16-bit unicode, at least semantically. length() counts 16-bit units from ucs-2/utf-16, not bytes from utf-8 or code points from utf-32. all charAt() and substring() etc. behave like that. markus
Re: Java and Unicode
MS> In the case of Java, the equivalent course of action would be to MS> stick with a 16-bit char as the base type for strings. I believe that Java strings use UTF-8 internally. At any rate the internal implementation is not exposed to applications -- note that `length' is a method in class String (while it is a field in vector classes). Juliusz
Re: Java and Unicode
On Thu, 16 Nov 2000, Markus Scherer wrote: > The ICU API was changed this way within a few months this year. Some of the >higher-level implementations are still to follow until next summer, when there will >be some 45000 CJK characters that will be infrequent but hard to ignore - the Chinese >and Japanese governments will insist on their support. I hope support comes soon, too, as many CJK characters commonly used in writing the Yue (Cantonese) language wound up in CJK Extension B, such as U+28319 lip1 'elevator/lift'. Until then, one'll have to use a legacy Big5-HKSCS. Thomas Chan [EMAIL PROTECTED]
Re: Java and Unicode
Elliotte Rusty Harold wrote: > For example, consider the charAt() method in java.lang.String: > > public char charAt(int index) Just for comparison, ICU added a method to its UnicodeString class equivalent to this: public int char32At(int index) More difficult than the string class was the CharacterIterator: It had many more failings in common with its Java sibling than a lack of UTF-16 support, among them semantics for forward iteration that are inefficient and unusual and especially bad for a variable-width encoding. The ICU API was changed this way within a few months this year. Some of the higher-level implementations are still to follow until next summer, when there will be some 45000 CJK characters that will be infrequent but hard to ignore - the Chinese and Japanese governments will insist on their support. markus
Re: Java and Unicode
At 7:26 AM -0800 11/16/00, Valeriy E. Ushakov wrote: >On Thu, Nov 16, 2000 at 05:58:27 -0800, Elliotte Rusty Harold wrote: > >> public char charAt(int index) >> >> This method is used to walk strings, looking at each character in >> turn, a useful thing to do. Clearly it would be possible to replace >> it with a method with a String return type like this: >> >> public String characterAt(int index) > >And what method you will use to obtain the (single) character in the >returned string? :-) The point is you don't need this. You would always work with strings and never with chars. The API would be changed so that the char data type could be fully deprecated. Alternate approach: define a new Char class that could be used in place of char everywhere a character type was needed. -- +---++---+ | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer | +---++---+ | The XML Bible (IDG Books, 1999) | | http://metalab.unc.edu/xml/books/bible/ | | http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/ | +--+-+ | Read Cafe au Lait for Java News: http://metalab.unc.edu/javafaq/ | | Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ | +--+-+
Re: Java and Unicode
On Thu, Nov 16, 2000 at 05:58:27 -0800, Elliotte Rusty Harold wrote: > public char charAt(int index) > > This method is used to walk strings, looking at each character in > turn, a useful thing to do. Clearly it would be possible to replace > it with a method with a String return type like this: > > public String characterAt(int index) And what method you will use to obtain the (single) character in the returned string? :-) SY, Uwe -- [EMAIL PROTECTED] | Zu Grunde kommen http://www.ptc.spbu.ru/~uwe/| Ist zu Grunde gehen
Re: Java and Unicode
At 4:44 PM -0800 11/15/00, Markus Scherer wrote: >In the case of Java, the equivalent course of action would be to >stick with a 16-bit char as the base type for strings. The int type >could be used in _additional_ APIs for single Unicode code points, >deprecating the old APIs with char. > It's not quite that simple. Many of the key APIs in Java already use ints instead of chars where chars are expected. In particular, the Reader and Writer classes in java.io do this. I do agree that it makes sense to use strings rather than characters. I'm just wondering how bad the transition is going to be. Could we get away with eliminating (or at least deprecating) the char data type completely and all methods that use it? And can we do that without breaking all existing code and redesigning the language? For example, consider the charAt() method in java.lang.String: public char charAt(int index) This method is used to walk strings, looking at each character in turn, a useful thing to do. Clearly it would be possible to replace it with a method with a String return type like this: public String charAt(int index) The returned string would contain a single character (which might be composed of two surrogate chars). However, we can't simply add that method because Java can't overload on return type. So we have to give that method a new name like: public String characterAt(int index) OK. That one's not too bad, maybe even more intelligible than what we're replacing. But we have to do this in hundreds of places in the API! Some will be much worse than this. Is it really going to be possible to make this sort of change everywhere? Or is it time to bite the bullet and break backwards compatibility? Or should we simply admit that non-BMP characters aren't that important and stick with the current API? Or perhaps provide special classes that handle non-BMP characters as an ugly-bolt-on to the language that will be used by a few Unicode afficionados but ignored by most programmers, just like wchar is ignored in C to this day? None of these solutions are attractive. It may take the next post-Java language to really solve them. -- +---++---+ | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer | +---++---+ | The XML Bible (IDG Books, 1999) | | http://metalab.unc.edu/xml/books/bible/ | | http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/ | +--+-+ | Read Cafe au Lait for Java News: http://metalab.unc.edu/javafaq/ | | Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ | +--+-+
Re: Java and Unicode
Please let's keep types for single characters and types for strings separate. ICU used to be in the same situation as Java: everything character/string used 16-bit types. In extension to UTF-16, we decided to keep the string base type at 16 bits for very good reasons like interoperability and memory consumption. For single characters, ICU changed APIs from 16-bit to 32-bit types. In the case of Java, the equivalent course of action would be to stick with a 16-bit char as the base type for strings. The int type could be used in _additional_ APIs for single Unicode code points, deprecating the old APIs with char. Whatever Sun decides to do with single characters, it will be most reasonable to keep the string encoding the same and just treat it as UTF-16 where that makes a difference. For details, see my presentation at the IUC 17 Unicode conference (2000 September, session B2). (See http://www.unicode.org/ - I am having some trouble with web access right now, so I cannot give you the URL...) markus
Re: Java and Unicode
John O'Conner wrote: > Yes. If you have been involved with Unicode for any period of time at all, you > would know that the Unicode consortium has advertised Unicode's 16-bit > encoding for a long, long time, even in its latest Unicode 3.0 spec. The > Unicode 3.0 spec clearly favors the 16-bit encoding of Unicode code units, and > the design chapter (chapter 2) never even hints at a 32-bit encoding form. Indeed. Though, to be fair, people have been talking about UCS-4 and then UTF-32 for quite awhile now, and the UTF-32 Technical Report has been approved for half a year. FYI, on November 9, the Unicode Technical Committee officially voted to make Unicode Technical Report #19 "UTF-32" a Unicode Standard Annex (UAX). This will be effective with the rollout of the Unicode Standard, Version 3.1, and will make the 32-bit transformation format a coequal partner with UTF-16 and UTF-8 as sanctioned Unicode encoding forms. > > The previous 2.0 spec (and previous specs as well) promoted this 16-bit > encoding too...and even claimed that Unicode was a 16-bit, "fixed-width", > coded character set. There are lots of reasons why Java's char is a 16-bit > value...the fact that the Unicode Consortium itself has promoted and defined > Unicode as a 16-bit coded character set for so long is probably the biggest. It is easy to look back from the year 2000 and wonder why. But it is also important to remember the context of 1989-1991. During that time frame, the loudest complaints were from those who were proclaiming that Unicode's move from 8-bit to 16-bit characters would break all software, choke the databases, inflate all documents by a factor of two, and generally end the world as we knew it. As it turns out, they were wrong on all counts. But the rhetorical structure of the Unicode Standard was initially set up to be a hard sell for 16-bit characters *as opposed to* 8-bit characters. The implementation world has moved on. Now we have an encoding model for Unicode that embraces an 8-bit, a 16-bit, *and* a 32-bit encoding form, while acknowledging that the character encoding per se is effectively 21 bits. This is more complicated than we hoped for originally, of course, but I think most of us agree that the incremental complexity in encoding forms is a price we are willing to pay in order to have a single character encoding standard that can interoperate in 8-, 16-, and 32-bit environments. --Ken
Re: Java and Unicode
Jungshik Shin wrote: > That's exactly what I have in mind about Java. I can't help wondering why > Sun chose 2byte char instead of 4byte char when it was plainly obvious > that 2byte wouldn't be enough in the very near future. The same can be > said of Mozilla which internally uses BMP-only as far as I know. > Was it due to concerns over things like saving memory/storage, etc? Yes. If you have been involved with Unicode for any period of time at all, you would know that the Unicode consortium has advertised Unicode's 16-bit encoding for a long, long time, even in its latest Unicode 3.0 spec. The Unicode 3.0 spec clearly favors the 16-bit encoding of Unicode code units, and the design chapter (chapter 2) never even hints at a 32-bit encoding form. The Java char attempts to capture the basic encoding unit of this 16-bit, widely accepted encoding method. I'm sure the choice seemed plainly obvious at the time. The previous 2.0 spec (and previous specs as well) promoted this 16-bit encoding too...and even claimed that Unicode was a 16-bit, "fixed-width", coded character set. There are lots of reasons why Java's char is a 16-bit value...the fact that the Unicode Consortium itself has promoted and defined Unicode as a 16-bit coded character set for so long is probably the biggest. -- John O'Conner
Re: Java and Unicode
On Wednesday, November 15, 2000, at 12:08 PM, Roozbeh Pournader wrote: > > > On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote: > > > I do not think they are so theoretical, with both 10646 and Unicode > > including them in the very new future (unless you count it as > theoretical > > when you drop an egg but it has not yet hit the ground!). > > Lemme think. You're saying that when I have not even seen a single egg > hitting the ground, I should believe that it will hit some day? ;) > > Well, you should be expecting about 45,000 eggs within the next six months.
Re: Java and Unicode
On Wed, 15 Nov 2000, Thomas Chan wrote: > On Wed, 15 Nov 2000, Jungshik Shin wrote: > > > On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote: > > > > > > Many people try to compare this to DBCS, but it really is not the same > > > thing understanding lead bytes and trail bytes in DBCS is *astoundingly* > > > more complicated than handling surrogate pairs. > > > > Well, it depends on what multibyte encoding you're talking about. In case > > of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to > > SJIS(Windows94?), Windows-949(UHC), Windows-950, WIndows-125x(JOHAB), > > ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about > > the same as UTF-16, I believe, especially in case of EUC-CN and EUC-KR) > > I would move EUC-JP and EUC-TW, and possibly EUC-KR (if you use more than > KS X 1001 in it) to the "complicated" group because of the shifting bytes > required to get to different planes/character sets. Well, EUC-KR has never used character sets other than US-ASCII(or its Korean variant KS X 1003) and KS X 1001 although a theoretical possibilty is there. More realistic (although very rarely used. there are only two known implementations :Hanterm - Korean xterm - and Mozilla ) complication for EUC-KR arises not from a third character set (KS X 1002) in EUC-KR but from 8byte-sequence representation of (11172-2350) Hangul syllables not covered by the repertoire of KS X 1001. As for EUC-JP(which uses JIS X 201/US-ASCII, JIS X 208 AND JIS X 0212) and EUC-TW, I know what you're saying. That's exactly why I added at the end of my prev. message 'especially in case of EUC-CN and EUC-KR' :-) Probably, I should have written among multibyte encodings at least EUC-CN and EUC-KR are as easy to handle as UTF-16. Jungshik Shin
Re: Java and Unicode
On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote: > I do not think they are so theoretical, with both 10646 and Unicode > including them in the very new future (unless you count it as theoretical > when you drop an egg but it has not yet hit the ground!). Lemme think. You're saying that when I have not even seen a single egg hitting the ground, I should believe that it will hit some day? ;)
Re: Java and Unicode
On Wed, 15 Nov 2000, Jungshik Shin wrote: > On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote: > > In any case, I think that UTF-16 is the answer here. > > > > Many people try to compare this to DBCS, but it really is not the same > > thing understanding lead bytes and trail bytes in DBCS is *astoundingly* > > more complicated than handling surrogate pairs. > > Well, it depends on what multibyte encoding you're talking about. In case > of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to > SJIS(Windows94?), Windows-949(UHC), Windows-950, WIndows-125x(JOHAB), > ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about > the same as UTF-16, I believe, especially in case of EUC-CN and EUC-KR) I would move EUC-JP and EUC-TW, and possibly EUC-KR (if you use more than KS X 1001 in it) to the "complicated" group because of the shifting bytes required to get to different planes/character sets. Thomas Chan [EMAIL PROTECTED]
Re: Java and Unicode
On Wed, 15 Nov 2000, Doug Ewell wrote: > Elliotte Rusty Harold <[EMAIL PROTECTED]> wrote: > > > There are a number of possibilities that don't break backwards > > compatibility (making trans-BMP characters require two chars rather > > than one, defining a new wchar primitive data type that is 4-bytes > > long as well as the old 2-byte char type, etc.) but they all make the > > language a lot less clean and obvious. In fact, they all more or less > This is one of the great difficulties in creating a "clean" design: > making it flexible enough so that it remains clean even in the face of > unexpected changes (like Unicode requiring more than 16 bits). > > But was it really unexpected? I wonder when the Java specification was > written -- specifically, was it before or after Unicode and JTC1/SC2/WG2 > began talking openly about moving beyond 16 bits? That's exactly what I have in mind about Java. I can't help wondering why Sun chose 2byte char instead of 4byte char when it was plainly obvious that 2byte wouldn't be enough in the very near future. The same can be said of Mozilla which internally uses BMP-only as far as I know. Was it due to concerns over things like saving memory/storage, etc? Jungshik Shin
Re: Java and Unicode
On Wed, 15 Nov 2000, Michael (michka) Kaplan wrote: > In any case, I think that UTF-16 is the answer here. > > Many people try to compare this to DBCS, but it really is not the same > thing understanding lead bytes and trail bytes in DBCS is *astoundingly* > more complicated than handling surrogate pairs. Well, it depends on what multibyte encoding you're talking about. In case of 'pure' EUC encodings (EUC-JP, EUC-KR, EUC-CN, EUC-TW) as opposed to SJIS(Windows94?), Windows-949(UHC), Windows-950, WIndows-125x(JOHAB), ISO-2022-JP(-2), ISO-2022-KR, ISO-2022-CN , it's not that hard (about the same as UTF-16, I believe, especially in case of EUC-CN and EUC-KR) Jungshik Shin
RE: Java and Unicode
Eliotte Rusty Harold wrote: > One thing I'm very curious about going forward: Right now character > values greater than 65535 are purely theoretical. However this will > change. It seems to me that handling these characters properly is > going to require redefining the char data type from two bytes to > four. This is a major incompatible change with existing Java. > (...) John O'Conner just wrote something about surrogates (http://www.unicode.org/unicode/faq/utf_bom.html#16) and UTF-16 (http://www.unicode.org/unicode/faq/utf_bom.html#5) in Java, but your message was probably already on its way: > You can currently store UTF-16 in the String and StringBuffer > classes. However, > all operations are on char values or 16-bit code units. The > upcoming release of > the J2SE platform will include support for Unicode 3.0 (maybe 3.0.1) > properties, case mapping, collation, and character break > iteration. There is no > explicit support for surrogate pairs in Unicode at this time, > although you can > certainly find out if a code unit is a surrogate unit. > > In the future, as characters beyond 0x become more > important, you can > expect that more robust, official support will ollow. > > -- John O'Conner _ Marco
Re: Java and Unicode
Elliotte Rusty Harold <[EMAIL PROTECTED]> wrote: > There are a number of possibilities that don't break backwards > compatibility (making trans-BMP characters require two chars rather > than one, defining a new wchar primitive data type that is 4-bytes > long as well as the old 2-byte char type, etc.) but they all make the > language a lot less clean and obvious. In fact, they all more or less > make Java feel like C and C++ feel when working with Unicode: like > something new has been bolted on after the fact, and it doesn't > really fit the old design. This is one of the great difficulties in creating a "clean" design: making it flexible enough so that it remains clean even in the face of unexpected changes (like Unicode requiring more than 16 bits). But was it really unexpected? I wonder when the Java specification was written -- specifically, was it before or after Unicode and JTC1/SC2/WG2 began talking openly about moving beyond 16 bits? -Doug Ewell Fullerton, California
Re: Java and Unicode
I do not think they are so theoretical, with both 10646 and Unicode including them in the very new future (unless you count it as theoretical when you drop an egg but it has not yet hit the ground!). In any case, I think that UTF-16 is the answer here. Many people try to compare this to DBCS, but it really is not the same thing understanding lead bytes and trail bytes in DBCS is *astoundingly* more complicated than handling surrogate pairs. michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "Elliotte Rusty Harold" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Sent: Wednesday, November 15, 2000 6:15 AM Subject: Re: Java and Unicode > One thing I'm very curious about going forward: Right now character > values greater than 65535 are purely theoretical. However this will > change. It seems to me that handling these characters properly is > going to require redefining the char data type from two bytes to > four. This is a major incompatible change with existing Java. > > There are a number of possibilities that don't break backwards > compatibility (making trans-BMP characters require two chars rather > than one, defining a new wchar primitive data type that is 4-bytes > long as well as the old 2-byte char type, etc.) but they all make the > language a lot less clean and obvious. In fact, they all more or less > make Java feel like C and C++ feel when working with Unicode: like > something new has been bolted on after the fact, and it doesn't > really fit the old design. > > Are there any plans for handling this? > -- > > +---++---+ > | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer | > +---++---+ > | The XML Bible (IDG Books, 1999) | > | http://metalab.unc.edu/xml/books/bible/ | > | http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/ | > +--+-+ > | Read Cafe au Lait for Java News: http://metalab.unc.edu/javafaq/ | > | Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ | > +--+-+ >
Re: Java and Unicode
One thing I'm very curious about going forward: Right now character values greater than 65535 are purely theoretical. However this will change. It seems to me that handling these characters properly is going to require redefining the char data type from two bytes to four. This is a major incompatible change with existing Java. There are a number of possibilities that don't break backwards compatibility (making trans-BMP characters require two chars rather than one, defining a new wchar primitive data type that is 4-bytes long as well as the old 2-byte char type, etc.) but they all make the language a lot less clean and obvious. In fact, they all more or less make Java feel like C and C++ feel when working with Unicode: like something new has been bolted on after the fact, and it doesn't really fit the old design. Are there any plans for handling this? -- +---++---+ | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer | +---++---+ | The XML Bible (IDG Books, 1999) | | http://metalab.unc.edu/xml/books/bible/ | | http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/ | +--+-+ | Read Cafe au Lait for Java News: http://metalab.unc.edu/javafaq/ | | Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ | +--+-+
Java and Unicode
In a recent post I said something like this: There is no explicit support for surrogate pairs in Unicode at this time. I meant to say this: There is no explicit support for surrogate pairs in Java at this time. Sorry for the confusion, John O'Conner
Re: Java and Unicode
You can currently store UTF-16 in the String and StringBuffer classes. However, all operations are on char values or 16-bit code units. The upcoming release of the J2SE platform will include support for Unicode 3.0 (maybe 3.0.1) properties, case mapping, collation, and character break iteration. There is no explicit support for surrogate pairs in Unicode at this time, although you can certainly find out if a code unit is a surrogate unit. In the future, as characters beyond 0x become more important, you can expect that more robust, official support will ollow. -- John O'Conner Jani Kajala wrote: > As Unicode will soon contain characters defined beyond the code point range > [0,65535] I'm wondering how is Java going to handle this? > > I didn't find any hints from JDK documentation either, at least a few days > ago when I browsed the Java documentation about internationalization I just > saw a comment that 'Unicode is a 16-bit encoding.' (two errors in one > sentence) > > Regards, > Jani Kajala
Java and Unicode
As Unicode will soon contain characters defined beyond the code point range [0,65535] I'm wondering how is Java going to handle this? I didn't find any hints from JDK documentation either, at least a few days ago when I browsed the Java documentation about internationalization I just saw a comment that 'Unicode is a 16-bit encoding.' (two errors in one sentence) Regards, Jani Kajala
Java and unicode
I am learning Java and learning the applying of unicode within Java programs. Perhaps readers might like to know of my experiences of using Java and unicode together. I learned Java from the free course at http://www.free-ed.net and got the Java Development Kit, a later version than the one originally used for the course, from the http://java.sun.com site. It is possible to download the software either as one file or as 23 files each less than 1.4 megabytes. The 23 separate files need joining together to produce the equivalent single file as if one had downloaded it all in one file. However, I found the 23 files method much more convenient. I was given help on unicode by various people in this group. In order to gain experience of using Java and unicode together I decided to write a program to carry out the decoding method that is described in chapter 8 of my electronic book The Eutotokens of Learning that is in http://www.users.globalnet.co.uk/~ngo which is our family webspace. The chapter is called "Software Unicorns". The decoding method starts about a quarter of the way into the chapter. The chapter is all in one web page. One may simply search for the first usage of the word unicorns, not counting the use in the title, and proceed from there if one wishes. There are also four diagrams detailing the coding format included. Readers are welcome to read the rest of the book should they so wish, but there is no need to do so if one simply wishes to look at the decoding method. The text relates to a method of being able to send Esperanto text using ascii 7 bit printing characters and have it automatically decoded. There is also a software unicorn screensaver available, accessed from the home page index. In programming the decoding method I wrote an applet such that the encoded text is keyed into a text area by the user at run time and the decoded text is then automatically drawn out using the drawString method. There is no need to bother about using the ZZZSTELOJ _ initialization part, simply proceed as if that has already been done before the applet started. The system involves the twelve Esperanto accented characters. The dialog fount that is used by default does not support these characters, but changing the font to Helvetica solves that problem. I found that the practical programming of the method in Java using unicode characters taught me a lot about both and about how they interact, so I thought that I would mention it here in case anyone might like to try it. William Overington 24 July 2000