This is a question to Google developers who are working on the Python
runtime. Before I get to the subject matter let me start with a
two-paragraph introduction to provide a common background for all
readers.

There is an interesting kind of XSS vulnerability involving malformed
UTF-8. Sometimes an attacker could trick our website into serving
malicious JavaScript by posting some specially crafted text containing
invalid UTF-8 characters. For more information, see the Doctype
article "Malformed UTF-8: Who said 'hello%EE' can't be dangerous"
<http://code.google.com/p/doctype/wiki/ArticleMalformedUtf8>.

To protect against this, all user input should be validated to be
correct UTF-8 before it is sent back to other users. As long as
untrusted text contains only valid byte sequences representing real
Unicode characters, it is easy to sanitize it by replacing any <, >,
quotation marks and other special characters with safe equivalents.

So here is my question. What's the easiest but secure way to validate
UTF-8 on App Engine? For example, my first guess would be to use the
str.decode() method (which apparently uses the codecs module):

  # the unsafe_user_input variable is a plain old str, not a
  # unicode string yet
  try:
    safe_unicode = unsafe_user_input.decode('utf8')
  except UnicodeDecodeError:
    # the input is not valid UTF-8
  response.write(escape(safe_unicode)) # this is safe now. Or is it?

But is that secure? Is it guaranteed that the UnicodeDecodeError
exception will be raised on any invalid or inappropriate UTF-8
characters in the input string?

The official Python documentation does not explicitly say that. It
might be the case that in some obscure situation the resulting unicode
object would contain something strange that could yield invalid UTF-8
when printed back to the user. That could make the above code
vulnerable.

I know that you have done a pretty thorough security audit of the
Python interpreter. You might have even applied some patches that can
affect UTF-8 decoding and encoding. Does the behavior of the
str.decode() function on App Engine differ in any way from what the
official Python interpreter v2.5.4 does?

Thank you,
  -- Alexander

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to