Tom Christiansen <tchr...@perl.com> added the comment: > Martin v. Löwis <mar...@v.loewis.de> added the comment:
> I think the WideCharToMultibyte approach is just incorrect. > I'm -1 on using wcswidth, though. Like you, I too seriously question using wcswidth() for this at all: The wcswidth() function either shall return 0 (if pwcs points to a null wide-character code), or return the number of column positions to be occupied by the wide-character string pointed to by pwcs, or return -1 (if any of the first n wide-character codes in the wide- character string pointed to by pwcs is not a printable wide- character code). I would be willing to bet (a small amount of) money it does not correctly inplmented Unicode print widths, even though one would certainly *think* it does according to this: The wcswidth() function determines the number of column positions required for the first n characters of pwcs, or until a null wide character (L'\0') is encountered. There are a bunch of "interesting" cases I would want it tested against. > We already have unicodedata.east_asian_width, which implements > http://unicode.org/reports/tr11/ > The outcomes of this function are these: > - F: full-width, width 2, compatibility character for a narrow char > - H: half-width, width 1, compatibility character for a narrow char > - W: wide, width 2 > - Na: narrow, width 1 > - A: ambiguous; width 2 in Asian context, width 1 in non-Asian context > - N: neutral; not used in Asian text, so has no width. Practically, width can > be considered as 1 Um, East_Asian_Width=Ambiguous (EA=A) isn't actually good enough for this. And EA=N cannot be consider 1, either. For example, some of the Marks are EA=A and some are EA=N, yet how may print columns they take varies. It is usually 0, but can be 1 at the start of the file/string or immediately after a linebreak sequence. Then there are things like the variation selectors which are never anything. Now consider the many \pC code points, like U+0009 CHARACTER TABULATION U+00AD SOFT HYPHEN U+200C ZERO WIDTH NON-JOINER U+FEFF ZERO WIDTH NO-BREAK SPACE U+2062 INVISIBLE TIMES A TAB is its own problem but SHY we know is only width=1 immediately before a linebreak or EOF, and ZWNJ and ZWNBSP are both certainly width=0. So are the INVISIBLE * code points. Context: Imagine you're trying to format a string so that it takes up exactly 20 columns: you need to know how many spaces to pad it with based on the print width. That is what the #12568 is needing to do, and you have to do much more than East Asian Width properties. I really do think that what #12568 is asking for is to have the equivalent of the Perl Unicode::GCString's columns() method, and that you aren't going to be able to handle text alignment of Unicode with anything that is much less of that. After all, #12568's title is "Add functions to get the width in columns of a character". I would very much like to compare what columns() thinks compared with what wcswidth() thinks. I bet wcswidth() is very simple-minded at best. I may of course be wrong. --tom ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12568> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com