Re: [discuss] WCHAR_T <=> UTF-8 conversion

Garrett D'Amore via illumos-discuss Fri, 15 Aug 2014 08:21:46 -0700

To get from wchar_t to multibyte string, you do wcstombs().  Note that the
resulting output will only be UTF-8 if the locale is *.UTF-8.  (If you're
in a different locale, the multi-byte-string may well be in a different
encoding.


Visually, your code above looks OK, but I'm not sure what's wrong.  Is it a
bug in libiconv?   My guess is so, since it seems that even if the encoding
was invalid, it shouldn't just dump core.  Instead it should return an
error such as EILSEQ.

Admittedly, I have less than perfect confidence in our libiconv
implementation.




On Fri, Aug 15, 2014 at 8:05 AM, Alexander Pyhalov <[email protected]> wrote:

> On 08/15/2014 18:44, Garrett D'Amore wrote:
>
>> I don't know why the icon module is dumping core, but recognize you can't
>> just wchar_t "Hello"
>>
>> The wide characters have to be initialized properly; it's not generally
>> possible to this from constant values directly.  Instead you have to
>> convert to the wide characters first, from another format.  I recommend,
>> if
>> you are running in a UTF-8 locale (such as ru_RU.UTF-8), that you use
>> mbstowcs() to convert a UTF-8 string to wchar_t's.  You should then be
>> able
>> to convert from UCS-4 to UTF-8.  (Note carefully though, the fact that the
>> wchar_t's are UCS-4 is *not an interface*.  The encoding of wchar_t's is a
>> platform implementation detail.
>>
>> Here's an example:
>>
>>    ...
>>    wchar_t wcs[32];
>>    char utf8[32];
>>    size_t inlen, outlen;
>>    iconv_t hdl;
>>
>>    setlocale(LC_ALL, "ru_RU.UTF-8");
>>
>>    mbstowcs(&wcs, "спасибо болшой", sizeof (wcs) / sizeof (wcs[0]));
>>    // wcs now contains UCS-4 version of Russian thank you very much
>>    ...
>>    inlen = wcslen(wcs) * sizeof (wchar_t);
>>    outlen = sizeof (utf8);
>>
>>    hdl = iconv_open("UTF-8", "UCS-4");
>>    iconv(hdl, wcs, &inlen, utf8, &outlen);
>>    // utf8 now contains "спасибо болшой"
>>
>>
> Let's try...
>
>   char out[1024];
>   iconv_t cd;
>   int ret;
>   wchar_t in[1024];
>   size_t inlen;
>
>   size_t outsz=sizeof(out);
>
>   setlocale(LC_ALL,"ru_RU.UTF-8");
>
>   mbstowcs(in,"Привет!",sizeof (in) / sizeof (in[0]));
>   inlen=wcslen(in) * sizeof (wchar_t);
>   cd = iconv_open("UTF-8","UCS-4");
>            if (cd == (iconv_t)-1) {
>                (void) fprintf(stderr, "iconv_open failed\n");
>                return (1);
>            }
>   iconv(cd,&in,&inlen,&out,&outsz);
>
> $ ./test_utf8_mbchar
> Segmentation Fault (core dumped)
>
>
>
>
>  Note that the above is most definitely *not* the recommended way to get to
>> UCS-4.  The only formally correct way to get to UCS-4 from UTF-8 is to use
>> iconv() to convert from UTF-8.  The only APIs that you should formally be
>> sending wchar_t's to are the wide character routines (e.g. wcslen()).
>>   Passing wchar_t's directly to iconv as I've done above is technically
>> incorrect, although I believe in the case above it will work.
>>
>> Note that this will not work in the "C" locale.
>>
>
> And what is the recommended way of converting wchar_t * to UTF-8 char *?
>
>
> --
> Best regards,
> Alexander Pyhalov,
> system administrator of Computer Center of Southern Federal University
>



-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] WCHAR_T <=> UTF-8 conversion

Reply via email to