Re: using re.sub with unicode string in response middleware

Malcolm Tredinnick Tue, 08 Jan 2008 18:49:09 -0800

Hey Gary,

On Tue, 2008-01-08 at 00:35 -0600, Gary Wilson Jr. wrote:
[...]
> So, looking at a couple places in Django trunk where response.content is used,
> these look like bugs:
> 
> 
> django.contrib.csrf.middleware.CsrfMiddleware.process_response:
> 
> def process_response(self, request, response):
>     ...
>     response.content = _POST_FORM_RE.sub(add_csrf_field, response.content)
>     ...


This isn't a bug, but it's subtle. There is only a problem if you are
trying to substitute a Python unicode object into a bytestring. That's
because Python tries to coerce the two elements into the same type
(unicode in this case) and it uses the "ascii" codec by default. If you
try to substitute a bytestring into a bytestring, no problems.

The example you started the thread with was the former case: you were
using a u'...' string as the first argument and a bytestring
(request.content) as the second argument. The CSRF middleware is using
bytestrings throughout, so it's safe.

> django.test.testcases.TestCase.assertContains:
> 
> def assertContains(self, response, text, count=None, status_code=200):
>     ...
>     real_count = response.content.count(text)
>     ...

Yes, this is a semi-bug. The "correct" way to use it is non-obvious: you
need to make sure 'text' is a bytestring -- so it's possible to use it
correctly, but the obvious way is sometimes wrong, which makes it a bad
API.

When this popped up it previously it was because "text" was a unicode
object and response.content wasn't, so Python tried to coerce the former
to a Python unicode object and failed dismally. This is an argument in
favour of adding a unicode_content attribute to HttpResponse.

If you want to be a really good maintainer here and really give encoding
in responses a workover, these are the things I would think about:

        - if somebody specifies a mimetype with a content encoding, we
        should use that for the encoding (not re-encode to UTF-8).
        
        - if the mimetype isn't something that can be sensibly
        re-encoded, don't. For example, image/jpeg shouldn't go through
        the re-encoding washing machine.
        
The problem is that this is all very difficult to get correct without
dozens and dozens of special cases. I suspect the right solution is that
if a mimetype is specified and a bytestring is passed in, we should
*never* re-encode the information. If a unicode object is passed in, we
can encode it according to the charset specified (of self._charset
otherwise). Think about it a bit and see if that makes sense to you.
This is fairly brain-twisting stuff, but there should be a simple
solution where we don't try to second-guess the user. Basically, I'd
like images and other binary opaque data not to be accidentally munged
by middleware (there's a ticket open about respecting the
content-transfer header, for example, that's related to this, too).

Cheers,
Malcolm

-- 
The early bird may get the worm, but the second mouse gets the cheese. 
http://www.pointy-stick.com/blog/


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: using re.sub with unicode string in response middleware

Reply via email to