Re: Unicode grapheme clusters

Bruce Momjian Sat, 21 Jan 2023 10:13:18 -0800

On Sat, Jan 21, 2023 at 12:37:30PM -0500, Bruce Momjian wrote:
> Well, as one of the URLs I quoted said:
> 
>       This is by design. wcwidth() is utterly broken. Any terminal or
>       terminal application that uses it is also utterly broken. Forget
>       about emoji wcwidth() doesn't even work with combining characters,
>       zero width joiners, flags, and a whole bunch of other things.
> 
> So, either we have to find a function in the library that will do the
> looping over the string for us, or we need to identify the special
> Unicode characters that create grapheme clusters and handle them in our
> code.


I just checked if wcswidth() would honor graphene clusters, though
wcwidth() does not, but it seems wcswidth() treats characters just like
wcwidth():

        $ LANG=en_US.UTF-8 grapheme_test
        wcswidth len=7
        
        bytes_consumed=4, wcwidth len=2
        bytes_consumed=4, wcwidth len=2
        bytes_consumed=3, wcwidth len=0
        bytes_consumed=3, wcwidth len=1
        bytes_consumed=3, wcwidth len=0
        bytes_consumed=4, wcwidth len=2

C test program attached.  This is on Debian 11.

-- 
  Bruce Momjian  <br...@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

#define _XOPEN_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <wchar.h>
#include <locale.h>

int
main (int argc, char *argv[])
{
	char *cp = "👩🏼‍⚕️🩺";
	wchar_t wch[100];
	int i;
	
	setlocale(LC_ALL, "en_US.UTF-8");

	mbstowcs(wch, cp, 100);
	printf("wcswidth len=%d\n\n", wcswidth(wch, 100));

	while (cp[i])
	{
		int res = mbtowc(wch, cp + i, 100);

		printf("bytes_consumed=%d, ", res);
	
		int len = wcwidth(wch[0]);
		printf("wcwidth len=%d\n", len);
		i += res;
	}

	return 0;
}

Re: Unicode grapheme clusters

Reply via email to