Re: [discuss] WCHAR_T <=> UTF-8 conversion

Garrett D'Amore via illumos-discuss Fri, 15 Aug 2014 07:45:13 -0700

I don't know why the icon module is dumping core, but recognize you can't
just wchar_t "Hello"


The wide characters have to be initialized properly; it's not generally
possible to this from constant values directly.  Instead you have to
convert to the wide characters first, from another format.  I recommend, if
you are running in a UTF-8 locale (such as ru_RU.UTF-8), that you use
mbstowcs() to convert a UTF-8 string to wchar_t's.  You should then be able
to convert from UCS-4 to UTF-8.  (Note carefully though, the fact that the
wchar_t's are UCS-4 is *not an interface*.  The encoding of wchar_t's is a
platform implementation detail.

Here's an example:

  ...
  wchar_t wcs[32];
  char utf8[32];
  size_t inlen, outlen;
  iconv_t hdl;

  setlocale(LC_ALL, "ru_RU.UTF-8");

  mbstowcs(&wcs, "спасибо болшой", sizeof (wcs) / sizeof (wcs[0]));
  // wcs now contains UCS-4 version of Russian thank you very much
  ...
  inlen = wcslen(wcs) * sizeof (wchar_t);
  outlen = sizeof (utf8);

  hdl = iconv_open("UTF-8", "UCS-4");
  iconv(hdl, wcs, &inlen, utf8, &outlen);
  // utf8 now contains "спасибо болшой"

Note that the above is most definitely *not* the recommended way to get to
UCS-4.  The only formally correct way to get to UCS-4 from UTF-8 is to use
iconv() to convert from UTF-8.  The only APIs that you should formally be
sending wchar_t's to are the wide character routines (e.g. wcslen()).
 Passing wchar_t's directly to iconv as I've done above is technically
incorrect, although I believe in the case above it will work.

Note that this will not work in the "C" locale.


On Fri, Aug 15, 2014 at 6:48 AM, Alexander Pyhalov <[email protected]> wrote:

> On 08/15/2014 12:50, Garrett D'Amore via illumos-discuss wrote:
>
>> Um.. Kind of.
>>
>> First off, when using UTF-8 locales, the wchar_t's are indeed
>> UTF-32/UCS-4.
>>   (Note that the byte order in this case is native endian!  You can't use
>> wchar_t's over the wire, they aren't portable; they are intended for
>> internal use only.  (Whereas UTF-8 is always safe on the wire.)
>>
>> Where it gets sticky is if you use a different locale.  For example, in
>> iso-8859-1 locales, the equivalent wchar_t is really just the 8-bit latin1
>> character, with the high order 3 bytes zero.  (On illumos wchar_t is
>> 32-bits wide -- on Windows its only 16.)
>>
>> In a GB18030 locale (or other Asian locale), the code points are all
>> different.  You can't just convert those to wchar_t's.
>>
>> The *correct* way to convert UTF-8 data to wchar_t's is with mbtowc (and
>> friends).  If you're *not* in a UTF-8 locale, then its quite likely that
>> there is *no* valid conversion of the UTF-8 data to a wchar_t.
>>
>> Programmatically, you can only convert a multibyte encoding to a wchar_t
>> (or vice versa) if the multibyte encoding matches the encoding scheme used
>> by the current locale.  That is, you can convert GB18030 multibyte strings
>> to wchar_t's only if you are in a locale like zh_CN.GB18030, and you can
>> only convert between UTF-8 and wchar_t's if your current locale is
>> something ending in .UTF-8.
>>
>> Hopefully that helps?
>>
>>
> I'm trying to understand how it should work, but receive permanent core
> dumps. Perhaps, I just can't read man pages... For example, I'd like to
> convert wchar_t* L"Привет!" to UTF-8 char *.
>
> $ gcc -ggdb test_utf8_mbchar.c -o test_utf8_mbchar
> $ ./test_utf8_mbchar
> ret is 13
> Привет!
> Segmentation Fault (core dumped)
> $ gdb ./test_utf8_mbchar core
> GNU gdb (GDB) 7.6.2
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.
> html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "i386-pc-solaris2.11".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /export/home/alp/srcs/tests/test_utf8_mbchar...done.
>
> warning: core file may not match specified executable file.
> [New LWP 1]
> [New LWP 1]
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> Core was generated by `./test_utf8_mbchar'.
> Program terminated with signal 11, Segmentation fault.
> #0  0xfedd079b in _icv_iconv () from /usr/lib/iconv/UCS-4%UTF-8.so
> (gdb) bt
> #0  0xfedd079b in _icv_iconv () from /usr/lib/iconv/UCS-4%UTF-8.so
> #1  0xfee7dc17 in iconv () from /lib/libc.so.1
> #2  0x08050ff8 in main () at test_utf8_mbchar.c:31
> (gdb)
>
>
>
> --
> Best regards,
> Alexander Pyhalov,
> system administrator of Computer Center of Southern Federal University
>



-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] WCHAR_T <=> UTF-8 conversion

Reply via email to