Re: Unicode String literals on various platforms

2000-08-08 Thread Antoine Leca

Bob Jones wrote:
 
 In a C program, how do you code Unicode string literals on the following
 platforms:
 NT
 Unix (Sun, AIX, HP-UX)
 AS/400

We devised a solution for this problem in the C99 Standard.
The "solution" is named "UCN", for Universal Character Notation, and 
is essentially to use the (borrowed from Java) \u notation, like
(with Ken's example)

  char C_thai[] = "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";

And similarlywchar_t C_thai[] = L"\u0E40... or
 TCHAR_T C_thai[] = T("\u0E40...
depending on your storing option. See below for more.

The benefit is that now, your C program is portable to any platform
where the C compiler complies to C99.
The drawback is that, nowadays, there is very few such compilers.

 
 Everything I have read says not to use wchar_t for cross platform apps
 because the size is not uniform, i.e. NT it is an unsigned short (2 bytes)
 while on Unix it is an unsigned int (4 bytes).  If you create your own TCHAR
 or whatever, how do you handle string literals? 

A similar problem exists with numbers, doesn't it? And the usual solution
is to *not* exchange data in internal format, but rather to use textual
representations. Agreed?

For a C _program_, where the textual representation are string litteral (rather
that array of integers), C99 UCN is the way to go.

Now, since you are talking of wchar_t vs. other forms of storing characters,
I wonder if you are not asking about the problem of the manipulated _datas_,
as opposed to the C program.

Then, I believe the solution is exactly the same as with numbers: internally
use whatever is the most appropriate to the current platform (the TCHAR_T/T()
solution of Microsoft is nice because it conveniently alternate to either
char or wchar_t depending of compilation options), but when exchanging datas,
change to a common, textual representation.

Look after the %lc %ls options of [w]printf/[w]scanf, to learn on how output/
input wide characters to/from text files. Another solution is to use "Unicode"
files, using some dedicated conversions, pretty much the same as using htons(),
ntohl(), etc. functions when dealing with low-level Internet protocols.


I agree there is currenly lacking a way in the C Standard to indicate that one
would open a text file using a specific encoding protocol (eg. UTF-16LE/BE,
or UTF-8). And the discussion on this matter have ending endless so far.


 On NT L"foobar" gives each character 2 bytes,

Yes

 but on Unix L"foobar" uses 4 bytes per character.

Depends on the compiler. Some are 4 bytes, some are 8 (64-bit boxes), some
are even only 8-bit (and are not Unicode compliant).

 Even worse I suspect is the AS/400 where the string literal is probably in
 EBCDIC.

Perhaps (and even probably, as L'a' is required to be equal to 'a' in C),
but what is the problem? You are not going to memcpy()-ing L"foobar", or
to fwrite()-ing it, are you? And I am sure your AS/400 implementation have
some way to specify on open() that a text file is really an "ASCII", rather
that EBCDIC, file. Or if it does not, it should...



Regards,
Antoine



Unicode String literals on various platforms

2000-08-03 Thread Jones, Bob

In a C program, how do you code Unicode string literals on the following
platforms:
NT
Unix (Sun, AIX, HP-UX)
AS/400

Everything I have read says not to use wchar_t for cross platform apps
because the size is not uniform, i.e. NT it is an unsigned short (2 bytes)
while on Unix it is an unsigned int (4 bytes).  If you create your own TCHAR
or whatever, how do you handle string literals?  On NT L"foobar" gives each
character 2 bytes, but on Unix L"foobar" uses 4 bytes per character.  Even
worse I suspect is the AS/400 where the string literal is probably in
EBCDIC.

Thanks,

Bob
[EMAIL PROTECTED]



Re: Unicode String literals on various platforms

2000-08-03 Thread Kenneth Whistler

Bob Jones asked:

 In a C program, how do you code Unicode string literals on the following
 platforms:
 NT
 Unix (Sun, AIX, HP-UX)
 AS/400


A somewhat cumbersome, but completely reliable crossplatform way to
code occasional Unicode string literals in a C program is: 

static unichar thai2[] = {0x0E40,0x0E02,0x0E17,0x0E32,0x0E49,0x0E1B,0x0E07,
0x0E1C,0x0E33,UNINULL};


where "unichar" is typedef'ed to an unsigned short (i.e. an unsigned 16-bit
integer).

This form of initialization of Unicode strings in C works on all platforms,
including the EBCDIC ones.

Of course, if you have to deal with lots of Unicode string literals, then you
may be better off coding some kind of a resource compiling system, and then
referring to the string literals by a system of id's, so that you don't
end up cluttering your code with lots of static array declarations that
may be difficult to maintain.

--Ken



Re: Unicode String literals on various platforms

2000-08-03 Thread Jeu George



In a C program, how do you code Unicode string literals on the following
platforms:
NT
Unix (Sun, AIX, HP-UX)
AS/400

could you explain this more specificallly. maybe give an example where you
need this

Everything I have read says not to use wchar_t for cross platform apps
because the size is not uniform, i.e. NT it is an unsigned short (2 bytes)
while on Unix it is an unsigned int (4 bytes).  If you create your own
TCHAR
or whatever, how do you handle string literals?  On NT L"foobar" gives each
character 2 bytes, but on Unix L"foobar" uses 4 bytes per character.

 Even
worse I suspect is the AS/400 where the string literal is probably in
EBCDIC.

EBCDIC is used for IBM systems


Thanks,

Bob
[EMAIL PROTECTED]