[issue9198] Should repr() print unicode characters outside the BMP?

Marc-Andre Lemburg Thu, 08 Jul 2010 02:35:01 -0700

Marc-Andre Lemburg <[email protected]> added the comment:

[Adding some bits from the discussion on #5127 for better context]

"""
Ezio Melotti wrote:
> >
> > Ezio Melotti <[email protected]> added the comment:
> >
> > [This should probably be discussed on python-dev or in another issue, so 
> > feel free to move the
conversation there.]
> >
> > The current implementation considers printable """all the characters except 
> > those characters
defined in the Unicode character database as following categories are 
considered printable.
> >   * Cc (Other, Control)
> >   * Cf (Other, Format)
> >   * Cs (Other, Surrogate)
> >   * Co (Other, Private Use)
> >   * Cn (Other, Not Assigned)
> >   * Zl Separator, Line ('\u2028', LINE SEPARATOR)
> >   * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
> >   * Zs (Separator, Space) other than ASCII space('\x20')."""
> >
> > We could also arbitrary exclude all the non-BMP chars, but that shouldn't 
> > be based on the
availability of the fonts IMHO.

Without fonts, you can't print the code points, even if the Unicode
database defines the code point as not having one of the above
classes. And that's probably also the reason why the Unicode
database doesn't define a printable property :-)

I also find the use of Zl, Zp and Zs in the definition somewhat
arbitrary: whitespace is certainly printable. This also doesn't
match the isprint() C lib API:

http://www.cplusplus.com/reference/clibrary/cctype/isprint/

"A printable character is any character that is not a control character."
"""

There are two aspects:

 * What to call a printable code point ?

   I'd suggest to follow the C lib approach: all non-control
   characters.

 * Which criteria to use for Unicode repr() ?

   Given the original intent of the extension to allow printable
   code points to pass through unescaped, it may be better to
   define "printable" based on the sys.stdout/sys.stderr encoding:

   A code points may pass through unescaped, if it is
   printable per the above definition, and does not cause problems
   with the sys.stdout/sys.stderr encoding.

   Since we can't apply this check based on a per character basis,
   I think we should only allow non-ASCII code points to pass through
   if sys.stdout/sys.stderr is set to utf-8, utf-16 or utf-32.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue9198>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue9198] Should repr() print unicode characters outside the BMP?

Reply via email to