Re: UTF-8 in basic CGI mode
Thanks, Sion, that makes sense! Would it be correct to assume that the encoding of strings retrieved by FieldStorage() would be the same as the encoding of the submitted web form (in my case utf-8)? Funny but I have the same form implemented in PSP (Python Server Pages), running under Apache with mod_python and it works transparently with no explicit charset translation required. On Jan 16, 4:31 pm, Sion Arrowsmith <[EMAIL PROTECTED]> wrote: > coldpizza <[EMAIL PROTECTED]> wrote: > >I am using this 'word' variable like this: > > >print u'' % (word) > > >and apparently this causes exceptions with non-ASCII strings. > > >I've also tried this: > >print u'' % > >(word.encode('utf8')) > >but I still get the same UnicodeDecodeError.. > > Your 'word' is a byte string (presumably UTF8 encoded). When python > is asked to insert a byte string into a unicode string (as you are > doing with the % operator, but the same applies to concatenation > with the + operator) it attempts to convert the byte string into > unicode. And the default encoding is 'ascii', and the ascii codec > takes a very strict view about what an ASCII character is -- and > that is that only characters below 128 are ASCII. > > To get it to work, you need to *decode* word. It is already UTF8 > (or something) encoded. Under most circumstances, use encode() to > turn unicode strings to byte strings, and decode() to go in the > other direction. > > -- > \S -- [EMAIL PROTECTED] --http://www.chaos.org.uk/~sion/ > "Frankly I have no feelings towards penguins one way or the other" > -- Arthur C. Clarke > her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 in basic CGI mode
coldpizza <[EMAIL PROTECTED]> wrote: >I am using this 'word' variable like this: > >print u'' % (word) > >and apparently this causes exceptions with non-ASCII strings. > >I've also tried this: >print u'' % >(word.encode('utf8')) >but I still get the same UnicodeDecodeError.. Your 'word' is a byte string (presumably UTF8 encoded). When python is asked to insert a byte string into a unicode string (as you are doing with the % operator, but the same applies to concatenation with the + operator) it attempts to convert the byte string into unicode. And the default encoding is 'ascii', and the ascii codec takes a very strict view about what an ASCII character is -- and that is that only characters below 128 are ASCII. To get it to work, you need to *decode* word. It is already UTF8 (or something) encoded. Under most circumstances, use encode() to turn unicode strings to byte strings, and decode() to go in the other direction. -- \S -- [EMAIL PROTECTED] -- http://www.chaos.org.uk/~sion/ "Frankly I have no feelings towards penguins one way or the other" -- Arthur C. Clarke her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump -- http://mail.python.org/mailman/listinfo/python-list
UTF-8 in basic CGI mode
Hi, I have a basic Python CGI web form that shows data from a SQLite3 database. It runs under the built-in CGIWebserver which looks something like this: [code] from BaseHTTPServer import HTTPServer from CGIHTTPServer import CGIHTTPRequestHandler HTTPServer("8000", CGIHTTPRequestHandler).serve_forever( ) [/code] The script runs Ok with ANSI characters, but when I try to process non- ASCII data I get an UnicodeDecodeError exception ('ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)). I have added the the 'u' prefix to all my literal strings, and I _have_ wrapped all my output statements into myString.encode('utf8', "replace"), but, apparently the UnicodeDecodeError exception occurs because of a string that I get back to the script through cgi.FieldStorage( ). I.e. I have the lines: form = cgi.FieldStorage( ) word= form['word'] which retrieve the 'word' value from a GET request. I am using this 'word' variable like this: print u'' % (word) and apparently this causes exceptions with non-ASCII strings. I've also tried this: print u'' % (word.encode('utf8')) but I still get the same UnicodeDecodeError.. What is the general good practice for working with UTF8? The standard Python CGI documentation has nothing on character sets. It looks insane to have to explicitly wrap every string with .encode('utf8'), but even this does not work. Could the problem be related to the encoding of the string returned by the cgi.fieldstorage()? My page is using UTF-8 encoding. What would be encoding for the data that comes from the browser after the form is submitted? Why does Python always try to use 'ascii'? I have checked all my strings and they are prefixed with 'u'. I have also tried replacing print statements with sys.stdout.write (DATA.encode('utf8')) but this did not help. Any clues? -- http://mail.python.org/mailman/listinfo/python-list
UTF-8 in basic CGI mode
Hi, I have a basic Python CGI web form that shows data from a SQLite3 database. It runs under the built-in CGIWebserver which looks like this: [code] import SimpleHTTPServer import SocketServer SocketServer.TCPServer(("", 80),SimpleHTTPServer.SimpleHTTPRequestHandler).serve_forever() [/code] The script runs Ok with ANSI characters, but when I try to process non- ASCII data I get an UnicodeDecodeError exception ('ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)). I have added the the 'u' prefix to all my literal strings, and I _have_ wrapped all my output statements into myString.encode('utf8', "replace"), but, apparently the UnicodeDecodeError exception occurs because of a string that I get back to the script through cgi.FieldStorage( ). I.e. I have the lines: form = cgi.FieldStorage( ) word= form['word'] which retrieve the 'word' value from a GET request. I am using this 'word' variable like this: print u'' % (word) and apparently this causes exceptions with non-ASCII strings. I've also tried this: print u'' % (word.encode('utf8')) but I still get the same UnicodeDecodeError.. What is the general good practice for working with UTF8? The standard Python CGI documentation has nothing on character sets. It looks insane to have to explicitly wrap every string with .encode('utf8'), but even this does not work. Could the problem be related to the encoding of the string returned by the cgi.fieldstorage()? My page is using UTF-8 encoding. What would be encoding for the data that comes from the browser after the form is submitted? Why does Python always try to use 'ascii'? I have checked all my strings and they are prefixed with 'u'. I have also tried replacing print statements with sys.stdout.write (DATA.encode('utf8')) but this did not help. Any clues? Thanks in advance. -- http://mail.python.org/mailman/listinfo/python-list