2009/7/28 Pedro Izecksohn: >> #include <stdio.h> >> #include <locale.h> >> #include <stdlib.h> >> #include <wchar.h> >> >> int main(void) { >> wchar_t wc; >> size_t ret; >> mbstate_t s = { 0 }; >> puts(setlocale(LC_CTYPE, "en_GB.UTF-8")); >> printf("%i\n", mbrtowc(&wc, "\xe2", 1, 0)); >> printf("%i\n", mbrtowc(&wc, "\x94", 1, 0)); >> printf("%i\n", mbrtowc(&wc, "\x84", 1, 0)); >> printf("%x\n", wc); >> return 0; >> } >> >> The sequence E2 94 84 should translate to U+2514. Instead, the second >> and third calls to mbrtowc report encoding errors. It does work >> correctly if the three bytes are passed to mbrtowc() in one go: > From the "Linux Programmer’s Manual" (release 3.15 of the Linux man-pages): > "If the n bytes starting at s do not contain a complete multibyte > character, mbrtowc() returns (size_t) -2."
Correct. And the first call to mbrtowc() does just that. The problem is that the second call returns -1, which signals an encoding error, even though E2 94 is a valid yet incomplete sequence, i.e. it should also return -2 and remember what it's seen so far in its internal state. The third call should return 1 and write 0x2504 to wc. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple