RE: Unicode String literals on various

2000-08-13 Thread Edward Cherlin

At 9:58 AM -0800 8/8/00, [EMAIL PROTECTED] wrote:
Hi, Antoine.

  I can continue to dissert on this subject

Please!

(all of this should
  finally be
   cooked in a FAQ anyway),

I'll help, which means I need as much of your dissertings as possible.

but I do not want to flood the list
  with a marginaly interesting subject.

Merci beaucoup. It was very informative!

Ciao.
   Marco

   P.S. You should not be so shy: up to date information
   about how Unicode may be used in the world's most
   important programming language does not sound so
   "off topic" or "marginally interesting" to me.

Second the motion. All in favor, please say "Aye" to Marco off-list.

   Ciao++
   M.

-- 

Edward Cherlin
Generalist
"A knot!" exclaimed Alice. "Oh, do let me help to undo it."
Alice in Wonderland



Re: Unicode String literals on various platforms

2000-08-08 Thread Antoine Leca

Bob Jones wrote:
 
 In a C program, how do you code Unicode string literals on the following
 platforms:
 NT
 Unix (Sun, AIX, HP-UX)
 AS/400

We devised a solution for this problem in the C99 Standard.
The "solution" is named "UCN", for Universal Character Notation, and 
is essentially to use the (borrowed from Java) \u notation, like
(with Ken's example)

  char C_thai[] = "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";

And similarlywchar_t C_thai[] = L"\u0E40... or
 TCHAR_T C_thai[] = T("\u0E40...
depending on your storing option. See below for more.

The benefit is that now, your C program is portable to any platform
where the C compiler complies to C99.
The drawback is that, nowadays, there is very few such compilers.

 
 Everything I have read says not to use wchar_t for cross platform apps
 because the size is not uniform, i.e. NT it is an unsigned short (2 bytes)
 while on Unix it is an unsigned int (4 bytes).  If you create your own TCHAR
 or whatever, how do you handle string literals? 

A similar problem exists with numbers, doesn't it? And the usual solution
is to *not* exchange data in internal format, but rather to use textual
representations. Agreed?

For a C _program_, where the textual representation are string litteral (rather
that array of integers), C99 UCN is the way to go.

Now, since you are talking of wchar_t vs. other forms of storing characters,
I wonder if you are not asking about the problem of the manipulated _datas_,
as opposed to the C program.

Then, I believe the solution is exactly the same as with numbers: internally
use whatever is the most appropriate to the current platform (the TCHAR_T/T()
solution of Microsoft is nice because it conveniently alternate to either
char or wchar_t depending of compilation options), but when exchanging datas,
change to a common, textual representation.

Look after the %lc %ls options of [w]printf/[w]scanf, to learn on how output/
input wide characters to/from text files. Another solution is to use "Unicode"
files, using some dedicated conversions, pretty much the same as using htons(),
ntohl(), etc. functions when dealing with low-level Internet protocols.


I agree there is currenly lacking a way in the C Standard to indicate that one
would open a text file using a specific encoding protocol (eg. UTF-16LE/BE,
or UTF-8). And the discussion on this matter have ending endless so far.


 On NT L"foobar" gives each character 2 bytes,

Yes

 but on Unix L"foobar" uses 4 bytes per character.

Depends on the compiler. Some are 4 bytes, some are 8 (64-bit boxes), some
are even only 8-bit (and are not Unicode compliant).

 Even worse I suspect is the AS/400 where the string literal is probably in
 EBCDIC.

Perhaps (and even probably, as L'a' is required to be equal to 'a' in C),
but what is the problem? You are not going to memcpy()-ing L"foobar", or
to fwrite()-ing it, are you? And I am sure your AS/400 implementation have
some way to specify on open() that a text file is really an "ASCII", rather
that EBCDIC, file. Or if it does not, it should...



Regards,
Antoine



RE: Unicode String literals on various

2000-08-08 Thread Marco . Cimarosti

Antoine Leca wrote:
   char C_thai[] = 
 "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";

Would the Unicode values be converted to the local SBCS/MBCS character set?

If yes:

Is the definition of this locale info part of the C99 standard itself, or is
it operating system's locale?

And what happens to Unicode values that cannot be converted in that
character set?

Thanks.
_ Marco



Re: Unicode String literals on various

2000-08-08 Thread Antoine Leca

[EMAIL PROTECTED] wrote:
 
 Antoine Leca wrote:
char C_thai[] =
  "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";
 
 Would the Unicode values be converted to the local SBCS/MBCS character set?

In this case, yes (assuming a normal C compiler).

With wchar_t / L"...", they are converted to the local "wide character set",
which happens to be Unicode on most boxes, with the following main exceptions:

- some (cheap) C compilers does not have any special support for wchar_t,
 so it defaults to the same as cahr, and are usually 8 bit;

- with East Asian C compilers, wchar_t are either Unicode or either
 a flat character coding, that is every character whether coded as SBCS or DBCS
 stands, with its nominal, legacy, code, in a 16-bit or 32-bit cell
 (that is different from MBCS in that the ASCII character are stored
 in cells the same width as DBCS characters)

- EBCDIC implementations have their own rules (for obvious reasons), that
 I do not know exactly (I am not sure they are consistent)

C99 also specifies that if __STDC_ISO_10646__ is defined, then the wchar_t
values are the Unicode codepoints (then to learn if it is UTF-16 or UTF-32,
one should look at WCHAR_MAX to learn if wchar_t are 16-bit or 32-bit).


 
 If yes:
 
 Is the definition of this locale info part of the C99 standard itself, or is
 it operating system's locale?

It is "implementation-defined". Which means:
- it is not required in any way by the C99 Standard itself (except if
 __STDC_ISO_10646__ is defined);
- it is required to be stated in full words in the documentation for the compiler;
- it can vary as per compilation options; often the OS's current locale is
 the default value, that can be overriden.

 
 And what happens to Unicode values that cannot be converted in that
 character set?

The compiler is required to fall back to something (it cannot refuse to
compile, nor it can simply drop the character); it is allowed to "fall back"
to different character depending on the typed character, though; so for example,

  #include stdio.h
  int main() {  printf("%ls\n", L"\u00C0 table!");  return 0;  }

Can produce (among others, this is UTF-8 encoded):

À table!
A table!
à table!
 table!



I can continue to dissert on this subject (all of this should finally be
cooked in a FAQ anyway), but I do not want to flood the list with a marginaly
interesting subject.


Antoine



RE: Unicode String literals on various

2000-08-08 Thread Marco . Cimarosti

Hi, Antoine.

 I can continue to dissert on this subject (all of this should 
 finally be
 cooked in a FAQ anyway), but I do not want to flood the list 
 with a marginaly interesting subject.

Merci beaucoup. It was very informative!

Ciao.
Marco

P.S. You should not be so shy: up to date information
about how Unicode may be used in the world's most
important programming language does not sound so
"off topic" or "marginally interesting" to me.

Ciao++
M.



Re: Unicode String literals on various platforms

2000-08-03 Thread Kenneth Whistler

Bob Jones asked:

 In a C program, how do you code Unicode string literals on the following
 platforms:
 NT
 Unix (Sun, AIX, HP-UX)
 AS/400


A somewhat cumbersome, but completely reliable crossplatform way to
code occasional Unicode string literals in a C program is: 

static unichar thai2[] = {0x0E40,0x0E02,0x0E17,0x0E32,0x0E49,0x0E1B,0x0E07,
0x0E1C,0x0E33,UNINULL};


where "unichar" is typedef'ed to an unsigned short (i.e. an unsigned 16-bit
integer).

This form of initialization of Unicode strings in C works on all platforms,
including the EBCDIC ones.

Of course, if you have to deal with lots of Unicode string literals, then you
may be better off coding some kind of a resource compiling system, and then
referring to the string literals by a system of id's, so that you don't
end up cluttering your code with lots of static array declarations that
may be difficult to maintain.

--Ken



Re: Unicode String literals on various platforms

2000-08-03 Thread Jeu George



In a C program, how do you code Unicode string literals on the following
platforms:
NT
Unix (Sun, AIX, HP-UX)
AS/400

could you explain this more specificallly. maybe give an example where you
need this

Everything I have read says not to use wchar_t for cross platform apps
because the size is not uniform, i.e. NT it is an unsigned short (2 bytes)
while on Unix it is an unsigned int (4 bytes).  If you create your own
TCHAR
or whatever, how do you handle string literals?  On NT L"foobar" gives each
character 2 bytes, but on Unix L"foobar" uses 4 bytes per character.

 Even
worse I suspect is the AS/400 where the string literal is probably in
EBCDIC.

EBCDIC is used for IBM systems


Thanks,

Bob
[EMAIL PROTECTED]