Re: python_2_unicode_compatible pitfalls

Mikhail Korobov Fri, 28 Dec 2012 01:07:37 -0800

Hi Aymeric,

FYI: I didn't check anything in my message manually, this is a "code 
review" based on grepping over django source code for common problems.


пятница, 28 декабря 2012 г., 6:15:06 UTC+6 пользователь Aymeric Augustin 
написал:
>
> Hi Mikhail,
>
> Thanks for your feedback! Your email touches many topics; can we try
> to extract the most important ones and identify what really needs fixing
> before the 1.5 release?
>
> 1) @python_2_unicode_compatible doesn't handle __repr__.
>
>
> Indeed, this decorator isn't designed to handle __repr__ -- I haven't
> even tried.
>
> At first sight, I'd say this isn't a sufficient reason to delay the 1.5 
> release,
> but it's a feature we should consider for 1.6.
>
>    For example, this affects django.db.models.options.Options,
>    django.core.files.base.File (and ContentFile),
>    django.contrib.admin.models.LogEntry, django.template.base.Variable
>    and probably many others (their __repr__ incorrectly returns unicode).
>
>
>  

> Is it a regression in 1.5? Or did the problem already exist?
>
>
This is a regression. It is not very serious (the unicode would contain 
only ascii characters in all the cases above), but it may have some subtle 
consequences, and it seems that these methods was written as returning 
unicode intentionally (e.g. LogEntry.__repr__ 
returns smart_text(self.action_time), not str(self.action_time)). 

Oh, and my late night review of Model.__repr__ was not correct 
:) six.text_type(self) has nothing to do with encoding/decoding, it calls 
__unicode__ method and __repr__ returns utf8-encoded text as it supposed.

 
 

> 2) under Python 2.x __str__ is implemented as __unicode__
>    encoded to utf8.
>
>
> Yes, this is a legacy behavior that I strongly disagree with. It
> makes __str__ / __unicode__ handling for Model subclasses
> exceedingly sketchy.* The only reason why I didn't remove
> it is backwards compatibility.
>
> * I expect future maintainers, including myself, to hate this:
> https://github.com/django/django/commit/2ea80b94
>
> This breaks 'print django_obj' when sys.stdout.encoding
>    is not utf8 because print uses __str__ (not __unicode__) for custom 
> objects,
>    and the terminal expects the result to be encoded in sys.stdout.encoding
>    (print encodes unicode strings to sys.stdout.encoding, but doesn't
>    use __unicode__ of objects; this is hard-coded in Python 2.x). 
>    This may affect REPL in Windows consoles and printing/writing to stdout 
>    in management commands.
>
>
> Django often defaults to utf-8 rather than try to figure out the
> most appropriate charset. Such code generally dates back
> to the unicode branch. So, yeah, this problem exists but it's
> not really new.
>
>
Defaulting to utf8 is better than trying to figure out the most appropriate 
charset, I fully agree with that (but prefer 7bit ascii myself which is not 
guessing). However, this makes REPL and printing not usable in environments 
where sys.stdout.encoding is not utf8. User usually expects to be able to 
type an object in console to check it, and this won't work (it may raise an 
exception or produce mangled output) under Windows (and some older *nixes) 
when there is a non-ascii data. Isn't it a bug? If it is not a bug, and it 
is an intended behavior that wouldn't be fixed (is backwards compatibility 
the only issue?), then maybe it should be documented ("sorry, we don't 
support cmd.exe under Python 2.x") to save developers time? Or maybe there 
are workarounds that would fix this? Currently REPL may fail because of 
django behavior, not because of buggy evil Windows or Python 2.x bugs, so 
IMHO it is django responsibility to either fix it or tell users what to do. 

 

>
> 3) @python_2_unicode_compatible produces incorrect results 
>    when applied twice (__str__ is patched by previous decorator 
> application 
>    and returns bytestring because of that).
>    This is easy to oversight e.g. when applying this decorator to a
>    subclass of a class which is wrapped to @python_2_unicode_compatible
>    and deleting the overridden __str__ afterwards.
>
>
> This is a good point. I looked at the `_was_fixed` technique you're
> proposing below. It does work in more cases, but I'm pretty sure
> you can still break it by overriding __str__ or __unicode__ in a
> sufficiently complex class hierarchy.
>
>
>
__unicode__ becomes pretty internal with @python_2_unicode_compatible, and 
overriding it may break things. I was about to introduce checking for 
overridden __unicode__ (by something like adding _nltk_is_generated 
attribute), but decided not to be too smart (there may be reasons to 
override __unicode__ I don't know about).

On the other hand, overriding __str__ is fully legit. Can you please create 
an example on how can __str__ be broken in complex class hierarchy with 
_was_fixed?

4) __str__ is not always properly implemented for this decorator in django
>    code. To work properly with @python_2_unicode_compatible,
>    __str__ must return unicode string. This is quite subtle.
>
>
> This is destabilizing for Python 2 developers, but once you get it,
> you get the whole "write Python 3 code that also works under
> Python 2" philosophy, and it's documented. I stand by this decision.
>
>
I mean that sometimes __str__ is not properly implemented in django code 
itself :)  @python_2_unicode_compatible + __str__ returning unicode is a 
good decision in my view. Check all __str__ methods in files where 
unicode_literals is not in effect (django.contrib.comments.Comment.__str__ 
and CommentFlag has the same subtle, almost theoretical issue which won't 
occur often because self.name and self.comment would be unicode strings 
most of time). 

>    For example, take a look at django.contrib.gis.maps.google.GEvent.
>    __str__ is implemented as 
>    "return mark_safe('"%s", %s' %(self.event, self.action))",
>    but "from __future__ import unicode_literals" is not applied to the 
> file.
>
>
> django.contrib.gis is a large piece of code that hasn't received as
> much attention as core. There's certainly a bunch of bugs like this one.
>
>    This means that if event and action are Python objects with both __str__
>    and __unicode__ methods defined (e.g. object of class wrapped with
>    python_2_unicode_compatible) then __str__ would be called for these 
> objects,
>    not __unicode__ (because the format string is a bytestring). Generally,
>    "%s" % something is a good and correct pattern for __str__ 
> implementation
>    (it does the right thing under both Python 2.x and 3.x when
>    unicode_literals future import is there), but it is incorrect under 
> Python
>    2.x if unicode_literals is not imported.
>
>
> I don't understand. Unless you're using @python_2_unicode_compatible,
> __str__ must return a string. `"%s" % foo` works in Python 2 and 3 *unless*
> you enable unicode_literals; if you do, the right pattern is `str("%s") % 
> foo`.
>

I mean "correct pattern for __str__" for classes that 
use @python_2_unicode_compatible.
 

>
> 5) %r is very tricky. If unicode_literals are in effect, or some 
>    arguments for string formatting are unicode,
>    "%r" % obj would trigger bytes decoding using sys.getdefaultencoding() 
> under
>    Python 2.x (unless obj is an unicode string), and if obj.__repr__ 
> returns
>    non-ascii text or obj is a bytestring, exception would be raised
>    (because sys.getdefaultencoding() is usually ascii).
>    This format specifier is used, for example, in a default_error_messages
>    for django.db.models.fields.Field; after switching to unicode_literals
>    this may start raising UnicodeDecodeExceptions for non-ascii choices
>    if they are custom objects (not unicode strings).
>    Another example is 
> django.http.response.HttpResponseBase._convert_to_charset
>    where BadHeaderError exception is raised: after switching to 
> unicode_literals
>    %r format specifier start triggering decoding of "value" using 
> sys.getdefaultencoding()
>    which is incorrect because "value" is a bytestring of 'charset' 
> encoding under
>    Python 2.x. Another example is django.utils.datastructures.SortedDict:
>    its __repr__ uses '%r: %r' % (k, v) for k, v in six.iteritems(self)
>    which may fail if key is an unicode string and a value is a bytestring
>    or an object with __repr__ returning non-ascii text. Another example
>    is django.utils.encoding.DjangoUnicodeDecodeError
>    (it has incorrect __str__ by the way because it returns unicode) -
>    it uses "%r" for self.obj, with unicode string formatter,
>    and this would blow up if __repr__ of obj returns non-ascii text.
>    There are other places where %r is used and they all are fragile.
>
>
> If I understand correctly, you're saying there's a regression in
> modules that define __repr__ methods and where unicode_literals
> were enabled?
>

No, the regression is in modules that use "%r" format specifier and the 
string containing this specifier becomes unicode. "byte string %r" % obj 
calls obj.__repr__() and uses the result directly, while u"unicode_string 
%r" % obj calls obj.__repr__() and decodes the result using 
sys.getdefaultencoding().
 

>
> I've implemented an another python_2_unicode_compatible decorator 
> (inspired by django's, the idea is cool) for NLTK: 
> https://github.com/nltk/nltk/blob/2and3/nltk/compat.py#L122 which 
> resolves some of issues above (it handles __repr__, limits __str__ and 
> __repr__ to ascii and supports subclassing better). The article (rather 
> lengthy, with some django bashing :) that provides motivation for the 
> decorator used in NLTK: http://kmike.ru/python-with-strings-attached/(the 
> code in the article is a bit outdated, it is not the code used in 
> NLTK; NLTK version was improved, but I didn't update the article yet).
>
>
> Obviously your version is more elaborate and handles more cases.
> The concept of `_was_fixed` is interesting but as explained above
> I'm not sure we can make it fully reliable. I'm also concerned about
> the transliteration / 7 bit stuff -- it smell like something people are
> going to complain about or bikeshed forever.
>


> The real question here is -- how much complexity is a core dev willing
> to introduce into Django, and defend in front of tens of thousands of
> developers for the next few years? As far as I'm concerned, the answer
> is "very little", because I'm not very smart and my days only have so
> many hours.
>
>
I agree that this all is too elaborate (awfully elaborate). But I don't 
know how to fix REPL issues in another way. The alternative (as far as I 
can tell) is not fixing an real issue in order to simplify the library 
implementation. This may be a legit approach if users of library are aware 
of the limitation though.
 
 

> To sum up, I think most of your remarks are valid. But taken all together
> they aren't easily actionable. You're mixing existing behaviors that we
> must keep for backwards-compatibility (even if they're debatable) and
> new behaviors that we still have a chance to fix before 1.5 sets them in
> stone. And you're requesting a massive lot of changes.
>
> To move forward, I'd suggest to divide and classify all this into smaller
> sets of problems, and describe them precisely:
> - regressions introduced by Python 3 support (release blockers),
> - limitations of the current Python 3 support,
> - str / repr method that don't return the correct type,
> - legacy design decisions that appear suboptimal,
> - etc.
>
> Thanks,
>
> -- 
> Aymeric.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/FBSijPwhjt4J.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.

Re: python_2_unicode_compatible pitfalls

Reply via email to