Re: using re.sub with unicode string in response middleware

Malcolm Tredinnick Mon, 07 Jan 2008 16:55:23 -0800


On Mon, 2008-01-07 at 18:28 -0600, Gary Wilson Jr. wrote:
> Malcolm Tredinnick wrote:
> > On Sun, 2008-01-06 at 15:25 -0600, Gary Wilson Jr. wrote:
> >> It appears that at this point, response.content is a utf8-encoded 
> >> bytestring.
> >> I'm playing with a response middleware doing something like:
> >>
> >> MY_RE.sub(u'%s</body>' % text, response.content)
> >>
> >> which raises a UnicodeDecodeError if response.content contains non-ascii.
> >>
> >> I understand that the strings need to be of the same type, but was 
> >> wondering
> >> if response.content needs to be returned as a utf8-encoded bytestring or if
> >> it's ok to convert it to unicode and return that.  Does it matter?
> > 
> > It must be UTF-8 (or, at least, a bytestring). Some encoding to be in
> > force, since "unicode" isn't a character encoding and response.content
> > is the last station before we send stuff back to the web server.
> 
> So to make sure I've got this right, would either of the two examples below be
> sufficient?
> 
> content = MY_RE.sub(u'%s</body>' % text, force_unicode(response.content))
> content = content.encode('utf-8')


Not quite sure what you're doing with "content" here, since a response
middleware modifies the response directly. Since you can happily set
"content" with a unicode object, you should just be able to do

        request.content = ....

> 
> content = MY_RE.sub((u'%s</body>' % text).encode('utf-8'), response.content)

In both cases, for absolutely bullet-proofness, you could use
response._charset as the encoding (rather than assuming it's the default
of UTF-8). Obviously depends on circumstances, but if this is for
something in Django's core, for example, it needs to be flexible. Every
now and again, somebody is going to change the DEFAULT_CHARSET value.

(There is, by the way, a subtle semi-bug hidden in there: if you pass in
a mime type, including an encoding, we still (re-)encode the data, which
is a little naughty. It's difficult to work out all the cases when we
should and shouldn't, though. Again, lots of "we could do..."
possibilities, but each one has trade-offs. That's a way-out-there
edge-case, though.)

> 
> > I realise this is slightly inconvenient for middleware classes, but
> > since we cannot tell ahead of time if any middleware classes are going
> > to be invoked, we have to treat response.content specially.
> 
> Could the handler not do the final encoding as the last thing it does on the
> response's way out (so after any middleware has been processed)?

Naturally, anything is possible, but I don't like the design.
HttpResponse returns a valid HTTP response via it's __str__ method and
valid HTTP data via the "content" attribute. That's a nicely
encapsulated design. Let's resist messing with it and keep the
responsibility in the right place.

If you really want to avoid the whole extra dozen characters of typing
now and again, let's add an unicode_content property to HttpResponse.

Regards,
Malcolm

-- 
Experience is something you don't get until just after you need it. 
http://www.pointy-stick.com/blog/


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: using re.sub with unicode string in response middleware

Reply via email to