Re: using re.sub with unicode string in response middleware

2008-01-08 Thread Malcolm Tredinnick


On Tue, 2008-01-08 at 22:45 -0600, Gary Wilson Jr. wrote:
> Malcolm Tredinnick wrote:
> > Hey Gary,
> > 
> > On Tue, 2008-01-08 at 00:35 -0600, Gary Wilson Jr. wrote:
> > [...]
> >> So, looking at a couple places in Django trunk where response.content is 
> >> used,
> >> these look like bugs:
> >>
> >>
> >> django.contrib.csrf.middleware.CsrfMiddleware.process_response:
> >>
> >> def process_response(self, request, response):
> >> ...
> >> response.content = _POST_FORM_RE.sub(add_csrf_field, response.content)
> >> ...
> > 
> > This isn't a bug, but it's subtle. There is only a problem if you are
> > trying to substitute a Python unicode object into a bytestring. That's
> > because Python tries to coerce the two elements into the same type
> > (unicode in this case) and it uses the "ascii" codec by default. If you
> > try to substitute a bytestring into a bytestring, no problems.
> 
> I was thinking that two differently-encoded bytestrings could cause problems,
> unless...  Are _all_ the valid HTTP character encodings supersets of ascii?

I'm not saying it makes sense to do this, just that it works without
raising a Python error because no type coercion takes place. Doing
substitutions on raw bytes is legal.

Not all encodings are supersets of ASCII, since the question doesn't
make sense in many situations. What's the character set encoding of a
PNG image? :-) It might even be that some text encodings fail to meet
that requirement, but they clearly aren't that common and it's only
going to be a requirement for legacy systems (in which case, most
middleware is unlikely to be useful: if a legacy system can't even
understand UTF-8, understanding things like E-tags and gzip encodings
will be beyond it).

We're making some not-unreasonable simplifying assumptions and making
things completely subclassable for people who want to go further. As the
need arises to incorporate stuff in core, we can do it. Because we need
to, not just because we can. At the moment, this hasn't really been a
huge issue and it looks solvable with one attribute being added.

Cheers,
Malcolm

-- 
Why can't you be a non-conformist like everyone else? 
http://www.pointy-stick.com/blog/


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~--~~~~--~~--~--~---



Re: using re.sub with unicode string in response middleware

2008-01-08 Thread Gary Wilson Jr.

Malcolm Tredinnick wrote:
> Hey Gary,
> 
> On Tue, 2008-01-08 at 00:35 -0600, Gary Wilson Jr. wrote:
> [...]
>> So, looking at a couple places in Django trunk where response.content is 
>> used,
>> these look like bugs:
>>
>>
>> django.contrib.csrf.middleware.CsrfMiddleware.process_response:
>>
>> def process_response(self, request, response):
>> ...
>> response.content = _POST_FORM_RE.sub(add_csrf_field, response.content)
>> ...
> 
> This isn't a bug, but it's subtle. There is only a problem if you are
> trying to substitute a Python unicode object into a bytestring. That's
> because Python tries to coerce the two elements into the same type
> (unicode in this case) and it uses the "ascii" codec by default. If you
> try to substitute a bytestring into a bytestring, no problems.

I was thinking that two differently-encoded bytestrings could cause problems,
unless...  Are _all_ the valid HTTP character encodings supersets of ascii?

I guess rot13 doesn't count :)

>>> re.compile('sha').sub('not fun', u'fun'.encode('rot13'))
'not fun'


> If you want to be a really good maintainer here and really give encoding
> in responses a workover, these are the things I would think about:

I'll take a look into this when I get some spare time.

Gary

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~--~~~~--~~--~--~---



Re: using re.sub with unicode string in response middleware

2008-01-08 Thread Malcolm Tredinnick

Hey Gary,

On Tue, 2008-01-08 at 00:35 -0600, Gary Wilson Jr. wrote:
[...]
> So, looking at a couple places in Django trunk where response.content is used,
> these look like bugs:
> 
> 
> django.contrib.csrf.middleware.CsrfMiddleware.process_response:
> 
> def process_response(self, request, response):
> ...
> response.content = _POST_FORM_RE.sub(add_csrf_field, response.content)
> ...

This isn't a bug, but it's subtle. There is only a problem if you are
trying to substitute a Python unicode object into a bytestring. That's
because Python tries to coerce the two elements into the same type
(unicode in this case) and it uses the "ascii" codec by default. If you
try to substitute a bytestring into a bytestring, no problems.

The example you started the thread with was the former case: you were
using a u'...' string as the first argument and a bytestring
(request.content) as the second argument. The CSRF middleware is using
bytestrings throughout, so it's safe.

> django.test.testcases.TestCase.assertContains:
> 
> def assertContains(self, response, text, count=None, status_code=200):
> ...
> real_count = response.content.count(text)
> ...

Yes, this is a semi-bug. The "correct" way to use it is non-obvious: you
need to make sure 'text' is a bytestring -- so it's possible to use it
correctly, but the obvious way is sometimes wrong, which makes it a bad
API.

When this popped up it previously it was because "text" was a unicode
object and response.content wasn't, so Python tried to coerce the former
to a Python unicode object and failed dismally. This is an argument in
favour of adding a unicode_content attribute to HttpResponse.

If you want to be a really good maintainer here and really give encoding
in responses a workover, these are the things I would think about:

- if somebody specifies a mimetype with a content encoding, we
should use that for the encoding (not re-encode to UTF-8).

- if the mimetype isn't something that can be sensibly
re-encoded, don't. For example, image/jpeg shouldn't go through
the re-encoding washing machine.

The problem is that this is all very difficult to get correct without
dozens and dozens of special cases. I suspect the right solution is that
if a mimetype is specified and a bytestring is passed in, we should
*never* re-encode the information. If a unicode object is passed in, we
can encode it according to the charset specified (of self._charset
otherwise). Think about it a bit and see if that makes sense to you.
This is fairly brain-twisting stuff, but there should be a simple
solution where we don't try to second-guess the user. Basically, I'd
like images and other binary opaque data not to be accidentally munged
by middleware (there's a ticket open about respecting the
content-transfer header, for example, that's related to this, too).

Cheers,
Malcolm

-- 
The early bird may get the worm, but the second mouse gets the cheese. 
http://www.pointy-stick.com/blog/


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~--~~~~--~~--~--~---



Re: using re.sub with unicode string in response middleware

2008-01-07 Thread Gary Wilson Jr.

Malcolm Tredinnick wrote:
> On Mon, 2008-01-07 at 18:28 -0600, Gary Wilson Jr. wrote:
>> Malcolm Tredinnick wrote:
>>> On Sun, 2008-01-06 at 15:25 -0600, Gary Wilson Jr. wrote:
 It appears that at this point, response.content is a utf8-encoded 
 bytestring.
 I'm playing with a response middleware doing something like:

 MY_RE.sub(u'%s' % text, response.content)

 which raises a UnicodeDecodeError if response.content contains non-ascii.

 I understand that the strings need to be of the same type, but was 
 wondering
 if response.content needs to be returned as a utf8-encoded bytestring or if
 it's ok to convert it to unicode and return that.  Does it matter?
>>> It must be UTF-8 (or, at least, a bytestring). Some encoding to be in
>>> force, since "unicode" isn't a character encoding and response.content
>>> is the last station before we send stuff back to the web server.
>> So to make sure I've got this right, would either of the two examples below 
>> be
>> sufficient?
>>
>> content = MY_RE.sub(u'%s' % text, force_unicode(response.content))
>> content = content.encode('utf-8')
> 
> Not quite sure what you're doing with "content" here, since a response
> middleware modifies the response directly. Since you can happily set
> "content" with a unicode object, you should just be able to do
> 
> request.content = 

Sorry, I was using "content" just for the sake of brevity.

>> content = MY_RE.sub((u'%s' % text).encode('utf-8'), response.content)
> 
> In both cases, for absolutely bullet-proofness, you could use
> response._charset as the encoding (rather than assuming it's the default
> of UTF-8). Obviously depends on circumstances, but if this is for
> something in Django's core, for example, it needs to be flexible. Every
> now and again, somebody is going to change the DEFAULT_CHARSET value.

Ah, thank you.  I was wrongly assuming this would always be utf-8 here.

> (There is, by the way, a subtle semi-bug hidden in there: if you pass in
> a mime type, including an encoding, we still (re-)encode the data, which
> is a little naughty. It's difficult to work out all the cases when we
> should and shouldn't, though. Again, lots of "we could do..."
> possibilities, but each one has trade-offs. That's a way-out-there
> edge-case, though.)
> 
>>> I realise this is slightly inconvenient for middleware classes, but
>>> since we cannot tell ahead of time if any middleware classes are going
>>> to be invoked, we have to treat response.content specially.
>> Could the handler not do the final encoding as the last thing it does on the
>> response's way out (so after any middleware has been processed)?
> 
> Naturally, anything is possible, but I don't like the design.
> HttpResponse returns a valid HTTP response via it's __str__ method and
> valid HTTP data via the "content" attribute. That's a nicely
> encapsulated design. Let's resist messing with it and keep the
> responsibility in the right place.
> 
> If you really want to avoid the whole extra dozen characters of typing
> now and again, let's add an unicode_content property to HttpResponse.

That was my second thought :)

> Regards,
> Malcolm

Thanks for clearing things up Malcolm.

So, looking at a couple places in Django trunk where response.content is used,
these look like bugs:


django.contrib.csrf.middleware.CsrfMiddleware.process_response:

def process_response(self, request, response):
...
response.content = _POST_FORM_RE.sub(add_csrf_field, response.content)
...


django.test.testcases.TestCase.assertContains:

def assertContains(self, response, text, count=None, status_code=200):
...
real_count = response.content.count(text)
...


Would you agree?

Looks like someone has hit the second one before:
http://groups.google.com/group/django-users/browse_thread/thread/b3d7a40423e011c3/311206ff6601a376

Gary

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~--~~~~--~~--~--~---



Re: using re.sub with unicode string in response middleware

2008-01-07 Thread Malcolm Tredinnick


On Mon, 2008-01-07 at 18:28 -0600, Gary Wilson Jr. wrote:
> Malcolm Tredinnick wrote:
> > On Sun, 2008-01-06 at 15:25 -0600, Gary Wilson Jr. wrote:
> >> It appears that at this point, response.content is a utf8-encoded 
> >> bytestring.
> >> I'm playing with a response middleware doing something like:
> >>
> >> MY_RE.sub(u'%s' % text, response.content)
> >>
> >> which raises a UnicodeDecodeError if response.content contains non-ascii.
> >>
> >> I understand that the strings need to be of the same type, but was 
> >> wondering
> >> if response.content needs to be returned as a utf8-encoded bytestring or if
> >> it's ok to convert it to unicode and return that.  Does it matter?
> > 
> > It must be UTF-8 (or, at least, a bytestring). Some encoding to be in
> > force, since "unicode" isn't a character encoding and response.content
> > is the last station before we send stuff back to the web server.
> 
> So to make sure I've got this right, would either of the two examples below be
> sufficient?
> 
> content = MY_RE.sub(u'%s' % text, force_unicode(response.content))
> content = content.encode('utf-8')

Not quite sure what you're doing with "content" here, since a response
middleware modifies the response directly. Since you can happily set
"content" with a unicode object, you should just be able to do

request.content = 

> 
> content = MY_RE.sub((u'%s' % text).encode('utf-8'), response.content)

In both cases, for absolutely bullet-proofness, you could use
response._charset as the encoding (rather than assuming it's the default
of UTF-8). Obviously depends on circumstances, but if this is for
something in Django's core, for example, it needs to be flexible. Every
now and again, somebody is going to change the DEFAULT_CHARSET value.

(There is, by the way, a subtle semi-bug hidden in there: if you pass in
a mime type, including an encoding, we still (re-)encode the data, which
is a little naughty. It's difficult to work out all the cases when we
should and shouldn't, though. Again, lots of "we could do..."
possibilities, but each one has trade-offs. That's a way-out-there
edge-case, though.)

> 
> > I realise this is slightly inconvenient for middleware classes, but
> > since we cannot tell ahead of time if any middleware classes are going
> > to be invoked, we have to treat response.content specially.
> 
> Could the handler not do the final encoding as the last thing it does on the
> response's way out (so after any middleware has been processed)?

Naturally, anything is possible, but I don't like the design.
HttpResponse returns a valid HTTP response via it's __str__ method and
valid HTTP data via the "content" attribute. That's a nicely
encapsulated design. Let's resist messing with it and keep the
responsibility in the right place.

If you really want to avoid the whole extra dozen characters of typing
now and again, let's add an unicode_content property to HttpResponse.

Regards,
Malcolm

-- 
Experience is something you don't get until just after you need it. 
http://www.pointy-stick.com/blog/


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~--~~~~--~~--~--~---



Re: using re.sub with unicode string in response middleware

2008-01-07 Thread Gary Wilson Jr.

Malcolm Tredinnick wrote:
> On Sun, 2008-01-06 at 15:25 -0600, Gary Wilson Jr. wrote:
>> It appears that at this point, response.content is a utf8-encoded bytestring.
>> I'm playing with a response middleware doing something like:
>>
>> MY_RE.sub(u'%s' % text, response.content)
>>
>> which raises a UnicodeDecodeError if response.content contains non-ascii.
>>
>> I understand that the strings need to be of the same type, but was wondering
>> if response.content needs to be returned as a utf8-encoded bytestring or if
>> it's ok to convert it to unicode and return that.  Does it matter?
> 
> It must be UTF-8 (or, at least, a bytestring). Some encoding to be in
> force, since "unicode" isn't a character encoding and response.content
> is the last station before we send stuff back to the web server.

So to make sure I've got this right, would either of the two examples below be
sufficient?

content = MY_RE.sub(u'%s' % text, force_unicode(response.content))
content = content.encode('utf-8')

content = MY_RE.sub((u'%s' % text).encode('utf-8'), response.content)

> I realise this is slightly inconvenient for middleware classes, but
> since we cannot tell ahead of time if any middleware classes are going
> to be invoked, we have to treat response.content specially.

Could the handler not do the final encoding as the last thing it does on the
response's way out (so after any middleware has been processed)?

Gary

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~--~~~~--~~--~--~---



Re: using re.sub with unicode string in response middleware

2008-01-06 Thread Malcolm Tredinnick


On Sun, 2008-01-06 at 15:25 -0600, Gary Wilson Jr. wrote:
> It appears that at this point, response.content is a utf8-encoded bytestring.
> I'm playing with a response middleware doing something like:
> 
> MY_RE.sub(u'%s' % text, response.content)
> 
> which raises a UnicodeDecodeError if response.content contains non-ascii.
> 
> I understand that the strings need to be of the same type, but was wondering
> if response.content needs to be returned as a utf8-encoded bytestring or if
> it's ok to convert it to unicode and return that.  Does it matter?

It must be UTF-8 (or, at least, a bytestring). Some encoding to be in
force, since "unicode" isn't a character encoding and response.content
is the last station before we send stuff back to the web server.

I realise this is slightly inconvenient for middleware classes, but
since we cannot tell ahead of time if any middleware classes are going
to be invoked, we have to treat response.content specially.

Regards,
Malcolm

-- 
Why can't you be a non-conformist like everyone else? 
http://www.pointy-stick.com/blog/


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~--~~~~--~~--~--~---