Re: horrible utf-8 performace in wc

Pádraig Brady Wed, 07 May 2008 18:06:33 -0700

Bo Borgerson wrote:
> Pádraig Brady wrote:
>> In the first 65535 code points there are also 404 chars which are
>> not classed as combining in the unicode database, but are classed
>> as zero width in the glibc locale data at least (zero-width space
>> being one of them like you mentioned). I determined this with the
>> attached progs:
>>
>> ./zw | python unidata.py | grep " 0 " | wc -l
> 
> 
> Hi Pádraig,
> 
> Wow, I knew there were some stand-alone zero-width characters, but I had
> no idea there were so many!


I'm not sure should many of those be counted anyway.
But the combining class is all we have to go on.

> 
> I poked around a little in gnulib and found a function for determining
> the combining class of a Unicode character.
> 
> I think the attached patch does what you were intending to do, and it
> also counts all of the stand-alone zero-width characters you found:

cool, thanks.
Could you could optimize it though and do the following
as you've already calculated wcwidth().

  if (!width && uc_combining_class(wide_char))
    chars--;

I did notice that wcwidth(0x1B44) returns 1 but I think that is because
this combining char is new in unicode version 5.0, and my locale tables
are probably not up to date. Search for "adeg adeg" here:
http://unicode.org/versions/Unicode5.0.0/ch11.pdf
I also notice the gnulib/uniwidth/ functions which might be more up to date
and calculate wcwidth(0x1B44) correctly as 0?

thanks again,
Pádraig


_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: horrible utf-8 performace in wc

Reply via email to