Re: [discuss] WCHAR_T <=> UTF-8 conversion

Alexander Pyhalov via illumos-discuss Fri, 15 Aug 2014 06:49:26 -0700

On 08/15/2014 12:50, Garrett D'Amore via illumos-discuss wrote:

Um.. Kind of.


First off, when using UTF-8 locales, the wchar_t's are indeed UTF-32/UCS-4.
  (Note that the byte order in this case is native endian!  You can't use
wchar_t's over the wire, they aren't portable; they are intended for
internal use only.  (Whereas UTF-8 is always safe on the wire.)

Where it gets sticky is if you use a different locale.  For example, in
iso-8859-1 locales, the equivalent wchar_t is really just the 8-bit latin1
character, with the high order 3 bytes zero.  (On illumos wchar_t is
32-bits wide -- on Windows its only 16.)

In a GB18030 locale (or other Asian locale), the code points are all
different.  You can't just convert those to wchar_t's.

The *correct* way to convert UTF-8 data to wchar_t's is with mbtowc (and
friends).  If you're *not* in a UTF-8 locale, then its quite likely that
there is *no* valid conversion of the UTF-8 data to a wchar_t.

Programmatically, you can only convert a multibyte encoding to a wchar_t
(or vice versa) if the multibyte encoding matches the encoding scheme used
by the current locale.  That is, you can convert GB18030 multibyte strings
to wchar_t's only if you are in a locale like zh_CN.GB18030, and you can
only convert between UTF-8 and wchar_t's if your current locale is
something ending in .UTF-8.

Hopefully that helps?

I'm trying to understand how it should work, but receive permanent coredumps. Perhaps, I just can't read man pages... For example, I'd like toconvert wchar_t* L"Привет!" to UTF-8 char *.


$ gcc -ggdb test_utf8_mbchar.c -o test_utf8_mbchar
$ ./test_utf8_mbchar
ret is 13
Привет!
Segmentation Fault (core dumped)
$ gdb ./test_utf8_mbchar core
GNU gdb (GDB) 7.6.2
Copyright (C) 2013 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later<http://gnu.org/licenses/gpl.html>

This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i386-pc-solaris2.11".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /export/home/alp/srcs/tests/test_utf8_mbchar...done.

warning: core file may not match specified executable file.
[New LWP 1]
[New LWP 1]
[Thread debugging using libthread_db enabled]
[New Thread 1 (LWP 1)]
Core was generated by `./test_utf8_mbchar'.
Program terminated with signal 11, Segmentation fault.
#0  0xfedd079b in _icv_iconv () from /usr/lib/iconv/UCS-4%UTF-8.so
(gdb) bt
#0  0xfedd079b in _icv_iconv () from /usr/lib/iconv/UCS-4%UTF-8.so
#1  0xfee7dc17 in iconv () from /lib/libc.so.1
#2  0x08050ff8 in main () at test_utf8_mbchar.c:31
(gdb)


--
Best regards,
Alexander Pyhalov,
system administrator of Computer Center of Southern Federal University

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
#include <iconv.h>

int main()
{
  char out[1024], buf[1024];
  iconv_t cd;
  int ret;
  const wchar_t *in;
  size_t bufsz;

  size_t outsz=1024;

  setlocale(LC_ALL,"ru_RU.UTF-8");
  
  in = L"Привет!";
  cd = iconv_open("UTF-8","UCS-4");
           if (cd == (iconv_t)-1) {
               (void) fprintf(stderr, "iconv_open failed\n");
               return (1);
           }
  ret=wcsrtombs((char *) buf,&in,1024,NULL);
  printf("ret is %d\n",ret);
  buf[ret+1]='\0';
  wprintf(L"%s\n",buf);
  bufsz=ret+1;
  
  iconv(cd,(const char**)&buf,&bufsz,(char **)&out,&outsz);

  return 0;
}


-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] WCHAR_T <=> UTF-8 conversion

Reply via email to