Hi Ingo,

On Thu, Sep 21, 2023 at 03:04:24PM +0200, Ingo Schwarze wrote:
> In general, the tool for checking the validity of UTF-8 strings
> is a simple loop around mblen(3) if you want to report the precise
> positions of errors found, or simply mbstowcs(3) with a NULL pwcs
> argument if you are content with a one-bit "valid" or "invalid" answer.

Acording to mbstowcs(3):
------------------------------------------------------------------------
RETURN VALUES
  mbstowcs() returns:

  0 or positive
        The value returned is the number of elements stored in the array
        pointed to by pwcs, except for a terminating null wide character
        (if any).  If pwcs is not null and the value returned is equal
        to n, the wide-character string pointed to by pwcs is not null
        terminated.  If pwcs is a null pointer, the value returned is
        the number of elements to contain the whole string converted,
        except for a terminating null wide character.

  (size_t)-1  The array indirectly pointed to by s contains a byte
              sequence forming invalid character.  In this case,
              mbstowcs() sets errno to indicate the error.

ERRORS
     mbstowcs() may cause an error in the following cases:

     [EILSEQ]  s points to the string containing invalid or
               incomplete multibyte character.
------------------------------------------------------------------------

To understand what mbstowcs(3) does I wrote the little test.c program
pasted at bottom.  In the following example [a] is UTF-8 aaculte and (a)
iso-latin aacute.

Using setlocale(LC_CTYPE, "en_US.UTF-8");

  $ cc -g -Wall test.c
  $ echo -n arbol | a.out
  ulen: 5
  $ echo -n [a]rbol | a.out
  ulen: 5
  $ echo -n (a)rbol | a.out
  ulen: 5

Using setlocale(LC_CTYPE, "C");

  $ cc -g -Wall test.c
  $ echo -n arbol | a.out
  ulen: 5
  $ echo -n [a]rbol | a.out
  ulen: 6
  $ echo -n (a)rbol | a.out
  ulen: 7

And no error message in any case.  I don't understand in which way those
return values let me know that the third string is invalid UTF-8.  Am I
doing something wrong?


test.c
========================================
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int
main()
{

        int c, i;
        size_t ulen;
        char s[100];

        i = 0;
        while ((c = getchar()) != EOF)
                s[i++] = c;

        s[i] = '\0';

        setlocale(LC_CTYPE, "en_US.UTF-8");
        //setlocale(LC_CTYPE, "C");

        if ((ulen = mbstowcs(NULL, s, 0)) == (size_t)-1)
                perror("error");

        printf("ulen: %zu\n", ulen);

        return 0;
}

-- 
Walter

Reply via email to