This is a question to Google developers who are working on the Python runtime. Before I get to the subject matter let me start with a two-paragraph introduction to provide a common background for all readers.
There is an interesting kind of XSS vulnerability involving malformed UTF-8. Sometimes an attacker could trick our website into serving malicious JavaScript by posting some specially crafted text containing invalid UTF-8 characters. For more information, see the Doctype article "Malformed UTF-8: Who said 'hello%EE' can't be dangerous" <http://code.google.com/p/doctype/wiki/ArticleMalformedUtf8>. To protect against this, all user input should be validated to be correct UTF-8 before it is sent back to other users. As long as untrusted text contains only valid byte sequences representing real Unicode characters, it is easy to sanitize it by replacing any <, >, quotation marks and other special characters with safe equivalents. So here is my question. What's the easiest but secure way to validate UTF-8 on App Engine? For example, my first guess would be to use the str.decode() method (which apparently uses the codecs module): # the unsafe_user_input variable is a plain old str, not a # unicode string yet try: safe_unicode = unsafe_user_input.decode('utf8') except UnicodeDecodeError: # the input is not valid UTF-8 response.write(escape(safe_unicode)) # this is safe now. Or is it? But is that secure? Is it guaranteed that the UnicodeDecodeError exception will be raised on any invalid or inappropriate UTF-8 characters in the input string? The official Python documentation does not explicitly say that. It might be the case that in some obscure situation the resulting unicode object would contain something strange that could yield invalid UTF-8 when printed back to the user. That could make the above code vulnerable. I know that you have done a pretty thorough security audit of the Python interpreter. You might have even applied some patches that can affect UTF-8 decoding and encoding. Does the behavior of the str.decode() function on App Engine differ in any way from what the official Python interpreter v2.5.4 does? Thank you, -- Alexander --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to google-appengine@googlegroups.com To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~----------~----~----~----~------~----~------~--~---