[re-adding the list] On 06/15/2011 03:28 PM, Al Bogner wrote: >> When all of the bytes are ignored as non-printable, then all three >> lines are identical, hence -u prints only one line. > > Ok and thanks. I had a different understanding of non-printable.
Non-printable translates to whether isprint(3) returns 0 for a given byte (single-byte locale, like C), or iswprint(3) returns 0 for a given wide character (Unicode character composed from UTF-8 bytes, multi-byte locale like de_DE.UTF-8). These functions are locale-specific (a byte value may be deemed printable in one locale but not another). Furthermore, isprint(0xa0) and iswprint(0xa0) may give different results within the same locale, if the implementation is trying to reject incomplete UTF-8 sequences and only understands complete wchar_t as characters, in which case any code that uses isprint() on the individual bytes of UTF-8 rather than iswprint() on the wchar_t of each composed Unicode character will get the (unfortunate) results that no multi-byte characters are recognized as printable. Factor into this mess the fact that upstream coreutils still lacks decent multi-byte handling in a lot of utilities. Various distros have add-on patches for better wchar_t handling, but as of yet they have not been consolidated into something that is easily maintainable and adds no overhead to the single-byte C locale situation. -- Eric Blake ebl...@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature