Re: using re.sub with unicode string in response middleware
On Tue, 2008-01-08 at 22:45 -0600, Gary Wilson Jr. wrote: > Malcolm Tredinnick wrote: > > Hey Gary, > > > > On Tue, 2008-01-08 at 00:35 -0600, Gary Wilson Jr. wrote: > > [...] > >> So, looking at a couple places in Django trunk where response.content is > >> used, > >> these look like bugs: > >> > >> > >> django.contrib.csrf.middleware.CsrfMiddleware.process_response: > >> > >> def process_response(self, request, response): > >> ... > >> response.content = _POST_FORM_RE.sub(add_csrf_field, response.content) > >> ... > > > > This isn't a bug, but it's subtle. There is only a problem if you are > > trying to substitute a Python unicode object into a bytestring. That's > > because Python tries to coerce the two elements into the same type > > (unicode in this case) and it uses the "ascii" codec by default. If you > > try to substitute a bytestring into a bytestring, no problems. > > I was thinking that two differently-encoded bytestrings could cause problems, > unless... Are _all_ the valid HTTP character encodings supersets of ascii? I'm not saying it makes sense to do this, just that it works without raising a Python error because no type coercion takes place. Doing substitutions on raw bytes is legal. Not all encodings are supersets of ASCII, since the question doesn't make sense in many situations. What's the character set encoding of a PNG image? :-) It might even be that some text encodings fail to meet that requirement, but they clearly aren't that common and it's only going to be a requirement for legacy systems (in which case, most middleware is unlikely to be useful: if a legacy system can't even understand UTF-8, understanding things like E-tags and gzip encodings will be beyond it). We're making some not-unreasonable simplifying assumptions and making things completely subclassable for people who want to go further. As the need arises to incorporate stuff in core, we can do it. Because we need to, not just because we can. At the moment, this hasn't really been a huge issue and it looks solvable with one attribute being added. Cheers, Malcolm -- Why can't you be a non-conformist like everyone else? http://www.pointy-stick.com/blog/ --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~--~~~~--~~--~--~---
Re: using re.sub with unicode string in response middleware
Malcolm Tredinnick wrote: > Hey Gary, > > On Tue, 2008-01-08 at 00:35 -0600, Gary Wilson Jr. wrote: > [...] >> So, looking at a couple places in Django trunk where response.content is >> used, >> these look like bugs: >> >> >> django.contrib.csrf.middleware.CsrfMiddleware.process_response: >> >> def process_response(self, request, response): >> ... >> response.content = _POST_FORM_RE.sub(add_csrf_field, response.content) >> ... > > This isn't a bug, but it's subtle. There is only a problem if you are > trying to substitute a Python unicode object into a bytestring. That's > because Python tries to coerce the two elements into the same type > (unicode in this case) and it uses the "ascii" codec by default. If you > try to substitute a bytestring into a bytestring, no problems. I was thinking that two differently-encoded bytestrings could cause problems, unless... Are _all_ the valid HTTP character encodings supersets of ascii? I guess rot13 doesn't count :) >>> re.compile('sha').sub('not fun', u'fun'.encode('rot13')) 'not fun' > If you want to be a really good maintainer here and really give encoding > in responses a workover, these are the things I would think about: I'll take a look into this when I get some spare time. Gary --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~--~~~~--~~--~--~---
Re: using re.sub with unicode string in response middleware
Hey Gary, On Tue, 2008-01-08 at 00:35 -0600, Gary Wilson Jr. wrote: [...] > So, looking at a couple places in Django trunk where response.content is used, > these look like bugs: > > > django.contrib.csrf.middleware.CsrfMiddleware.process_response: > > def process_response(self, request, response): > ... > response.content = _POST_FORM_RE.sub(add_csrf_field, response.content) > ... This isn't a bug, but it's subtle. There is only a problem if you are trying to substitute a Python unicode object into a bytestring. That's because Python tries to coerce the two elements into the same type (unicode in this case) and it uses the "ascii" codec by default. If you try to substitute a bytestring into a bytestring, no problems. The example you started the thread with was the former case: you were using a u'...' string as the first argument and a bytestring (request.content) as the second argument. The CSRF middleware is using bytestrings throughout, so it's safe. > django.test.testcases.TestCase.assertContains: > > def assertContains(self, response, text, count=None, status_code=200): > ... > real_count = response.content.count(text) > ... Yes, this is a semi-bug. The "correct" way to use it is non-obvious: you need to make sure 'text' is a bytestring -- so it's possible to use it correctly, but the obvious way is sometimes wrong, which makes it a bad API. When this popped up it previously it was because "text" was a unicode object and response.content wasn't, so Python tried to coerce the former to a Python unicode object and failed dismally. This is an argument in favour of adding a unicode_content attribute to HttpResponse. If you want to be a really good maintainer here and really give encoding in responses a workover, these are the things I would think about: - if somebody specifies a mimetype with a content encoding, we should use that for the encoding (not re-encode to UTF-8). - if the mimetype isn't something that can be sensibly re-encoded, don't. For example, image/jpeg shouldn't go through the re-encoding washing machine. The problem is that this is all very difficult to get correct without dozens and dozens of special cases. I suspect the right solution is that if a mimetype is specified and a bytestring is passed in, we should *never* re-encode the information. If a unicode object is passed in, we can encode it according to the charset specified (of self._charset otherwise). Think about it a bit and see if that makes sense to you. This is fairly brain-twisting stuff, but there should be a simple solution where we don't try to second-guess the user. Basically, I'd like images and other binary opaque data not to be accidentally munged by middleware (there's a ticket open about respecting the content-transfer header, for example, that's related to this, too). Cheers, Malcolm -- The early bird may get the worm, but the second mouse gets the cheese. http://www.pointy-stick.com/blog/ --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~--~~~~--~~--~--~---
Re: using re.sub with unicode string in response middleware
Malcolm Tredinnick wrote: > On Mon, 2008-01-07 at 18:28 -0600, Gary Wilson Jr. wrote: >> Malcolm Tredinnick wrote: >>> On Sun, 2008-01-06 at 15:25 -0600, Gary Wilson Jr. wrote: It appears that at this point, response.content is a utf8-encoded bytestring. I'm playing with a response middleware doing something like: MY_RE.sub(u'%s' % text, response.content) which raises a UnicodeDecodeError if response.content contains non-ascii. I understand that the strings need to be of the same type, but was wondering if response.content needs to be returned as a utf8-encoded bytestring or if it's ok to convert it to unicode and return that. Does it matter? >>> It must be UTF-8 (or, at least, a bytestring). Some encoding to be in >>> force, since "unicode" isn't a character encoding and response.content >>> is the last station before we send stuff back to the web server. >> So to make sure I've got this right, would either of the two examples below >> be >> sufficient? >> >> content = MY_RE.sub(u'%s' % text, force_unicode(response.content)) >> content = content.encode('utf-8') > > Not quite sure what you're doing with "content" here, since a response > middleware modifies the response directly. Since you can happily set > "content" with a unicode object, you should just be able to do > > request.content = Sorry, I was using "content" just for the sake of brevity. >> content = MY_RE.sub((u'%s' % text).encode('utf-8'), response.content) > > In both cases, for absolutely bullet-proofness, you could use > response._charset as the encoding (rather than assuming it's the default > of UTF-8). Obviously depends on circumstances, but if this is for > something in Django's core, for example, it needs to be flexible. Every > now and again, somebody is going to change the DEFAULT_CHARSET value. Ah, thank you. I was wrongly assuming this would always be utf-8 here. > (There is, by the way, a subtle semi-bug hidden in there: if you pass in > a mime type, including an encoding, we still (re-)encode the data, which > is a little naughty. It's difficult to work out all the cases when we > should and shouldn't, though. Again, lots of "we could do..." > possibilities, but each one has trade-offs. That's a way-out-there > edge-case, though.) > >>> I realise this is slightly inconvenient for middleware classes, but >>> since we cannot tell ahead of time if any middleware classes are going >>> to be invoked, we have to treat response.content specially. >> Could the handler not do the final encoding as the last thing it does on the >> response's way out (so after any middleware has been processed)? > > Naturally, anything is possible, but I don't like the design. > HttpResponse returns a valid HTTP response via it's __str__ method and > valid HTTP data via the "content" attribute. That's a nicely > encapsulated design. Let's resist messing with it and keep the > responsibility in the right place. > > If you really want to avoid the whole extra dozen characters of typing > now and again, let's add an unicode_content property to HttpResponse. That was my second thought :) > Regards, > Malcolm Thanks for clearing things up Malcolm. So, looking at a couple places in Django trunk where response.content is used, these look like bugs: django.contrib.csrf.middleware.CsrfMiddleware.process_response: def process_response(self, request, response): ... response.content = _POST_FORM_RE.sub(add_csrf_field, response.content) ... django.test.testcases.TestCase.assertContains: def assertContains(self, response, text, count=None, status_code=200): ... real_count = response.content.count(text) ... Would you agree? Looks like someone has hit the second one before: http://groups.google.com/group/django-users/browse_thread/thread/b3d7a40423e011c3/311206ff6601a376 Gary --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~--~~~~--~~--~--~---
Re: using re.sub with unicode string in response middleware
On Mon, 2008-01-07 at 18:28 -0600, Gary Wilson Jr. wrote: > Malcolm Tredinnick wrote: > > On Sun, 2008-01-06 at 15:25 -0600, Gary Wilson Jr. wrote: > >> It appears that at this point, response.content is a utf8-encoded > >> bytestring. > >> I'm playing with a response middleware doing something like: > >> > >> MY_RE.sub(u'%s' % text, response.content) > >> > >> which raises a UnicodeDecodeError if response.content contains non-ascii. > >> > >> I understand that the strings need to be of the same type, but was > >> wondering > >> if response.content needs to be returned as a utf8-encoded bytestring or if > >> it's ok to convert it to unicode and return that. Does it matter? > > > > It must be UTF-8 (or, at least, a bytestring). Some encoding to be in > > force, since "unicode" isn't a character encoding and response.content > > is the last station before we send stuff back to the web server. > > So to make sure I've got this right, would either of the two examples below be > sufficient? > > content = MY_RE.sub(u'%s' % text, force_unicode(response.content)) > content = content.encode('utf-8') Not quite sure what you're doing with "content" here, since a response middleware modifies the response directly. Since you can happily set "content" with a unicode object, you should just be able to do request.content = > > content = MY_RE.sub((u'%s' % text).encode('utf-8'), response.content) In both cases, for absolutely bullet-proofness, you could use response._charset as the encoding (rather than assuming it's the default of UTF-8). Obviously depends on circumstances, but if this is for something in Django's core, for example, it needs to be flexible. Every now and again, somebody is going to change the DEFAULT_CHARSET value. (There is, by the way, a subtle semi-bug hidden in there: if you pass in a mime type, including an encoding, we still (re-)encode the data, which is a little naughty. It's difficult to work out all the cases when we should and shouldn't, though. Again, lots of "we could do..." possibilities, but each one has trade-offs. That's a way-out-there edge-case, though.) > > > I realise this is slightly inconvenient for middleware classes, but > > since we cannot tell ahead of time if any middleware classes are going > > to be invoked, we have to treat response.content specially. > > Could the handler not do the final encoding as the last thing it does on the > response's way out (so after any middleware has been processed)? Naturally, anything is possible, but I don't like the design. HttpResponse returns a valid HTTP response via it's __str__ method and valid HTTP data via the "content" attribute. That's a nicely encapsulated design. Let's resist messing with it and keep the responsibility in the right place. If you really want to avoid the whole extra dozen characters of typing now and again, let's add an unicode_content property to HttpResponse. Regards, Malcolm -- Experience is something you don't get until just after you need it. http://www.pointy-stick.com/blog/ --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~--~~~~--~~--~--~---
Re: using re.sub with unicode string in response middleware
Malcolm Tredinnick wrote: > On Sun, 2008-01-06 at 15:25 -0600, Gary Wilson Jr. wrote: >> It appears that at this point, response.content is a utf8-encoded bytestring. >> I'm playing with a response middleware doing something like: >> >> MY_RE.sub(u'%s' % text, response.content) >> >> which raises a UnicodeDecodeError if response.content contains non-ascii. >> >> I understand that the strings need to be of the same type, but was wondering >> if response.content needs to be returned as a utf8-encoded bytestring or if >> it's ok to convert it to unicode and return that. Does it matter? > > It must be UTF-8 (or, at least, a bytestring). Some encoding to be in > force, since "unicode" isn't a character encoding and response.content > is the last station before we send stuff back to the web server. So to make sure I've got this right, would either of the two examples below be sufficient? content = MY_RE.sub(u'%s' % text, force_unicode(response.content)) content = content.encode('utf-8') content = MY_RE.sub((u'%s' % text).encode('utf-8'), response.content) > I realise this is slightly inconvenient for middleware classes, but > since we cannot tell ahead of time if any middleware classes are going > to be invoked, we have to treat response.content specially. Could the handler not do the final encoding as the last thing it does on the response's way out (so after any middleware has been processed)? Gary --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~--~~~~--~~--~--~---
Re: using re.sub with unicode string in response middleware
On Sun, 2008-01-06 at 15:25 -0600, Gary Wilson Jr. wrote: > It appears that at this point, response.content is a utf8-encoded bytestring. > I'm playing with a response middleware doing something like: > > MY_RE.sub(u'%s' % text, response.content) > > which raises a UnicodeDecodeError if response.content contains non-ascii. > > I understand that the strings need to be of the same type, but was wondering > if response.content needs to be returned as a utf8-encoded bytestring or if > it's ok to convert it to unicode and return that. Does it matter? It must be UTF-8 (or, at least, a bytestring). Some encoding to be in force, since "unicode" isn't a character encoding and response.content is the last station before we send stuff back to the web server. I realise this is slightly inconvenient for middleware classes, but since we cannot tell ahead of time if any middleware classes are going to be invoked, we have to treat response.content specially. Regards, Malcolm -- Why can't you be a non-conformist like everyone else? http://www.pointy-stick.com/blog/ --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~--~~~~--~~--~--~---