Marc-Andre Lemburg <m...@egenix.com> added the comment: [Adding some bits from the discussion on #5127 for better context]
""" Ezio Melotti wrote: > > > > Ezio Melotti <ezio.melo...@gmail.com> added the comment: > > > > [This should probably be discussed on python-dev or in another issue, so > > feel free to move the conversation there.] > > > > The current implementation considers printable """all the characters except > > those characters defined in the Unicode character database as following categories are considered printable. > > * Cc (Other, Control) > > * Cf (Other, Format) > > * Cs (Other, Surrogate) > > * Co (Other, Private Use) > > * Cn (Other, Not Assigned) > > * Zl Separator, Line ('\u2028', LINE SEPARATOR) > > * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR) > > * Zs (Separator, Space) other than ASCII space('\x20').""" > > > > We could also arbitrary exclude all the non-BMP chars, but that shouldn't > > be based on the availability of the fonts IMHO. Without fonts, you can't print the code points, even if the Unicode database defines the code point as not having one of the above classes. And that's probably also the reason why the Unicode database doesn't define a printable property :-) I also find the use of Zl, Zp and Zs in the definition somewhat arbitrary: whitespace is certainly printable. This also doesn't match the isprint() C lib API: http://www.cplusplus.com/reference/clibrary/cctype/isprint/ "A printable character is any character that is not a control character." """ There are two aspects: * What to call a printable code point ? I'd suggest to follow the C lib approach: all non-control characters. * Which criteria to use for Unicode repr() ? Given the original intent of the extension to allow printable code points to pass through unescaped, it may be better to define "printable" based on the sys.stdout/sys.stderr encoding: A code points may pass through unescaped, if it is printable per the above definition, and does not cause problems with the sys.stdout/sys.stderr encoding. Since we can't apply this check based on a per character basis, I think we should only allow non-ASCII code points to pass through if sys.stdout/sys.stderr is set to utf-8, utf-16 or utf-32. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue9198> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com