Bo Borgerson wrote: > Pádraig Brady wrote: >> In the first 65535 code points there are also 404 chars which are >> not classed as combining in the unicode database, but are classed >> as zero width in the glibc locale data at least (zero-width space >> being one of them like you mentioned). I determined this with the >> attached progs: >> >> ./zw | python unidata.py | grep " 0 " | wc -l > > > Hi Pádraig, > > Wow, I knew there were some stand-alone zero-width characters, but I had > no idea there were so many!
I'm not sure should many of those be counted anyway. But the combining class is all we have to go on. > > I poked around a little in gnulib and found a function for determining > the combining class of a Unicode character. > > I think the attached patch does what you were intending to do, and it > also counts all of the stand-alone zero-width characters you found: cool, thanks. Could you could optimize it though and do the following as you've already calculated wcwidth(). if (!width && uc_combining_class(wide_char)) chars--; I did notice that wcwidth(0x1B44) returns 1 but I think that is because this combining char is new in unicode version 5.0, and my locale tables are probably not up to date. Search for "adeg adeg" here: http://unicode.org/versions/Unicode5.0.0/ch11.pdf I also notice the gnulib/uniwidth/ functions which might be more up to date and calculate wcwidth(0x1B44) correctly as 0? thanks again, Pádraig _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils