Hi Ingo,
On Thu, Sep 21, 2023 at 03:04:24PM +0200, Ingo Schwarze wrote:
> In general, the tool for checking the validity of UTF-8 strings
> is a simple loop around mblen(3) if you want to report the precise
> positions of errors found, or simply mbstowcs(3) with a NULL pwcs
> argument if you are content with a one-bit "valid" or "invalid" answer.
Acording to mbstowcs(3):
------------------------------------------------------------------------
RETURN VALUES
mbstowcs() returns:
0 or positive
The value returned is the number of elements stored in the array
pointed to by pwcs, except for a terminating null wide character
(if any). If pwcs is not null and the value returned is equal
to n, the wide-character string pointed to by pwcs is not null
terminated. If pwcs is a null pointer, the value returned is
the number of elements to contain the whole string converted,
except for a terminating null wide character.
(size_t)-1 The array indirectly pointed to by s contains a byte
sequence forming invalid character. In this case,
mbstowcs() sets errno to indicate the error.
ERRORS
mbstowcs() may cause an error in the following cases:
[EILSEQ] s points to the string containing invalid or
incomplete multibyte character.
------------------------------------------------------------------------
To understand what mbstowcs(3) does I wrote the little test.c program
pasted at bottom. In the following example [a] is UTF-8 aaculte and (a)
iso-latin aacute.
Using setlocale(LC_CTYPE, "en_US.UTF-8");
$ cc -g -Wall test.c
$ echo -n arbol | a.out
ulen: 5
$ echo -n [a]rbol | a.out
ulen: 5
$ echo -n (a)rbol | a.out
ulen: 5
Using setlocale(LC_CTYPE, "C");
$ cc -g -Wall test.c
$ echo -n arbol | a.out
ulen: 5
$ echo -n [a]rbol | a.out
ulen: 6
$ echo -n (a)rbol | a.out
ulen: 7
And no error message in any case. I don't understand in which way those
return values let me know that the third string is invalid UTF-8. Am I
doing something wrong?
test.c
========================================
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int
main()
{
int c, i;
size_t ulen;
char s[100];
i = 0;
while ((c = getchar()) != EOF)
s[i++] = c;
s[i] = '\0';
setlocale(LC_CTYPE, "en_US.UTF-8");
//setlocale(LC_CTYPE, "C");
if ((ulen = mbstowcs(NULL, s, 0)) == (size_t)-1)
perror("error");
printf("ulen: %zu\n", ulen);
return 0;
}
--
Walter