Re: utf8 and ftplib
Fredrik Lundh wrote: > character references refer to code points in the Unicode code > space, so you just convert the bytes you get after converting > to UTF-8. "so you cannot just", of course. -- http://mail.python.org/mailman/listinfo/python-list
Re: utf8 and ftplib
Richard Lewis wrote: > On Mon, 20 Jun 2005 14:27:17 +0200, "Fredrik Lundh" > <[EMAIL PROTECTED]> said: > > > > well, you're messing it up all by yourself. getting rid of all the > > codecs and > > unicode2charrefs nonsense will fix this: > > > Thanks for being so patient and understanding. > > OK, I've taken it all out. The only thinking about encoding I had to do > in the actual code I'm working on was to use: > file.write(document.toxml(encoding="utf-8")) > > instead of just > file.write(document.toxml()) > > because otherwise I got errors on copyright symbol characters. sounds like a bug in minidom... > My code now works without generating any errors but Konqueror's KHTML > and Embedded Advanced Text Viewer and IE5 on the Mac still show > capital-A-with-a-tilde in all the files that have been > generated/altered. Whereas my text editor and Mozilla show them > correctly. > > The "unicode2charrefs() nonsense" was an attempt to make it output with > character references rather than literal characters for all characters > with codes greater than 128. Is there a way of doing this? character references refer to code points in the Unicode code space, so you just convert the bytes you get after converting to UTF-8. however, if you're only using characters from the ISO Latin 1 set (which is a strict subset of Unicode), you could en- code to "iso-8859-1" and run unicode2charrefs on the result. but someone should really fix minidom so it does the right thing. (fwiw, if you use my ElementTree kit, you can simply do tree.write(encoding="us-ascii") and the toolkit will then use charrefs for any character that's not plain ascii. you can get ElementTree from here: http://effbot.org/zone/element-index.htm ) -- http://mail.python.org/mailman/listinfo/python-list
Re: utf8 and ftplib
Richard Lewis wrote: > My code now works without generating any errors but Konqueror's KHTML > and Embedded Advanced Text Viewer and IE5 on the Mac still show > capital-A-with-a-tilde in all the files that have been > generated/altered. Whereas my text editor and Mozilla show them > correctly. How are you viewing the files? You have to tell the browser that they are UTF-8. If you just double-click the file, the browser will use its default encoding. If you are server the files from a web server then you should set the Content-Type header correctly. Or you can tell the browser directly (try View / Encoding in IE). Kent -- http://mail.python.org/mailman/listinfo/python-list
Re: utf8 and ftplib
On Mon, 20 Jun 2005 14:27:17 +0200, "Fredrik Lundh" <[EMAIL PROTECTED]> said: > > well, you're messing it up all by yourself. getting rid of all the > codecs and > unicode2charrefs nonsense will fix this: > Thanks for being so patient and understanding. OK, I've taken it all out. The only thinking about encoding I had to do in the actual code I'm working on was to use: file.write(document.toxml(encoding="utf-8")) instead of just file.write(document.toxml()) because otherwise I got errors on copyright symbol characters. (And similarly, I had to use file.write(unicode_string.encode("utf-8")) in another part of the actual code in order to prevent the same problem.) My code now works without generating any errors but Konqueror's KHTML and Embedded Advanced Text Viewer and IE5 on the Mac still show capital-A-with-a-tilde in all the files that have been generated/altered. Whereas my text editor and Mozilla show them correctly. The "unicode2charrefs() nonsense" was an attempt to make it output with character references rather than literal characters for all characters with codes greater than 128. Is there a way of doing this? (I know people will argue that character references are only preferred by humans and text editors, but if I could generate my output HTML documents with character references rather than literal characters then I wouldn't have the problem of incorrectly displayed characters on Konqueror and IE 5 for Mac. Which would be nice.) Cheers, Richard -- http://mail.python.org/mailman/listinfo/python-list
Re: utf8 and ftplib
Richard Lewis wrote: > OK, I'm still not getting this unicode business. obviously. > > aàáâã > eèéêë > iìíîï > oòóôõ > oùúûü > > > (If testing, make sure you save this as utf-8 encoded.) why? that XML snippet doesn't include any UTF-8-encoded characters. ::: >file = codecs.open(sys.argv[1], "r", "utf-8") >document = parse(file) >file.close() why do you insist on decoding the stream you pass to the XML parser, when you've already been told that you shouldn't do that? change this to: document = parse(sys.argv[1]) >print document.toxml(encoding="utf-8") this converts the document to UTF-8, and prints it to stdout. if you get gibberish, your stdout wants some other encoding. if you get "capital- A-with-tilde" gibberish, your stdout expects ISO-8859-1. try changing this to: print document.toxml(encoding=sys.stdout.encoding) >out_str = unicode2charrefs(document.toxml(encoding="utf-8")) this converts the document to UTF-8, and then translates the *encoded* data to character references as if the document had been encoded as ISO- 8859-1. this makes no sense at all, and results in an XML document full of "capital-A-with-tilde" gibberish. > i.e., does anyone else get two byte sequences beginning with > capital-A-with-tilde instead of the expected characters? since you've requested UTF-8 output, "capital A with tilde" is the expected result if you're directing output to an ISO-8859-1 stream. > the output file is still wrong. well, you're messing it up all by yourself. getting rid of all the codecs and unicode2charrefs nonsense will fix this: document = parse(sys.argv[1]) # parser decodes ... manipulate document ... file = open(..., "w") file.write(document.toxml(encoding="utf-8")) # writer encodes file.close() -- http://mail.python.org/mailman/listinfo/python-list
Re: utf8 and ftplib
On Mon, 20 Jun 2005 12:37:42 +0100, "Richard Lewis" <[EMAIL PROTECTED]> said: > [SNIP] Just add to this: my input document was written using character references rather than literal characters (as was the sample output document). However, I've just noticed that my mail client (or maybe something else?) has converted the character references to literal characters. -- http://mail.python.org/mailman/listinfo/python-list
Re: utf8 and ftplib
OK, I'm still not getting this unicode business. Given this document: == aàáâã eèéêë iìíîï oòóôõ oùúûü == (If testing, make sure you save this as utf-8 encoded.) and this Python script: == import sys from xml.dom.minidom import * from xml.dom import * import codecs import string CHARACTERS = range(128,255) def unicode2charrefs(s): "Returns a unicode string with all the non-ascii characters from the given unicode string converted to character references." result = u"" for c in s: code = ord(c) if code in CHARACTERS: result += u"" + string.zfill(str(code), 3).decode('utf-8') + u";" else: result += c.encode('utf-8') return result def main(): print "Parsing file..." file = codecs.open(sys.argv[1], "r", "utf-8") document = parse(file) file.close() print "done." print document.toxml(encoding="utf-8") out_str = unicode2charrefs(document.toxml(encoding="utf-8")) print "Writing to '" + sys.argv[1] + "~' ..." file = codecs.open(sys.argv[1] + "~", "w", "utf-8") file.write(out_str) file.close() print "done." if __name__ == "__main__": main() == Does anyone else get this output from the "print document.toxml(encoding="utf-8")" line: aà áâã eèéêë iìÃîï oòóôõ oùúûü and, similarly, this output document: == aà áâã eèéêë iìÃîï oòóôõ oùúûü == i.e., does anyone else get two byte sequences beginning with capital-A-with-tilde instead of the expected characters? I'm using the Kate editor from KDE and Konsole (using bash) shell on Linux (2.6 kernel). Does that make any difference? I've just tried it on the unicode-aware xterm and the "print document.toxml(encoding="utf-8")" line produces the expected output but the output file is still wrong. Any ideas whats wrong? Cheers, Richard -- http://mail.python.org/mailman/listinfo/python-list
Re: utf8 and ftplib
Richard Lewis wrote: > OK, I've fiddled around a bit more but I still haven't managed to get it > to work. I get the fact that its not the FTP operation thats causing the > problem so it must be either the xml.minidom.parse() function (and > whatever sort of file I give that) or the way that I write my results to > output files after I've done my DOM processing. I'll post some more > detailed code: > > def open_file(file_name): >ftp = ftplib.FTP(self.host) >ftp.login(self.login, self.passwd) > >content_file = file(file_name, 'w+b') >ftp.retrbinary("RETR " + self.path, content_file.write) >ftp.quit() >content_file.close() > >## Case 1: >#self.document = parse(file_name) > >## Case 2: >#self.document = parse(codecs.open(file_name, 'r+b', "utf-8")) > ># Case 3: >content_file = codecs.open(file_name, 'r', "utf-8") >self.document = parse(codecs.EncodedFile(content_file, "utf-8", >"utf-8")) >content_file.close() > > In Case1 I get the incorrectly encoded characters. case 1 is the only one where you use the XML parser as it is designed to be used (on the stream level, XML is defined in terms of encoded text, not Unicode characters. the parser will decode things for you) given that he XML tree returned by the parser contains *decoded* Uni- code characters (in Unicode string objects), what makes you so sure that you're getting "incorrectly encoded characters" from the parser? (I wonder why this is so hard for so many people? hardly any programmer has any problem telling the difference between, say, a 32-bit binary floating point value on disk, a floating point object, and the string representation of a float. but replace the float with a Unicode character, and anglocentric programmers immediately resort to poking-with-a-stick-in-the-dark programming. I'll figure it out, some day...) -- http://mail.python.org/mailman/listinfo/python-list
Re: utf8 and ftplib
"Richard Lewis" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > > On Thu, 16 Jun 2005 12:06:50 -0600, "John Roth" > <[EMAIL PROTECTED]> said: >> "Richard Lewis" <[EMAIL PROTECTED]> wrote in message >> news:[EMAIL PROTECTED] >> > Hi there, >> > >> > I'm having a problem with unicode files and ftplib (using Python >> > 2.3.5). >> > >> > I've got this code: >> > >> > xml_source = codecs.open("foo.xml", 'w+b', "utf8") >> > #xml_source = file("foo.xml", 'w+b') >> > >> > ftp.retrbinary("RETR foo.xml", xml_source.write) >> > #ftp.retrlines("RETR foo.xml", xml_source.write) >> > >> >> It looks like there are at least two problems here. The major one >> is that you seem to have a misconception about utf-8 encoding. >> > Who doesn't? ;-) Lots of people. It's not difficult to understand, it just takes a bit of attention to the messy details. The basic concept is that Unicode is _always_ processed using a unicode string _in the program_. On disk or across the internet, it's _always_ stored in an encoded form, frequently but not always utf-8. A regular string _never_ stores raw unicode; it's always some encoding. When you read text data from the internet, it's _always_ in some encoding. If that encoding is one of the utf- encodings, it needs to be converted to unicode to be processed, but it does not need to be changed at all to write it to disk. >> Whatever program you are using to read it has to then decode >> it from utf-8 into unicode. Failure to do this is what is causing >> the extra characters on output. >> > >> >> Amusingly, this would have worked: >> >> xml_source = codecs.EncodedFile("foo.xml", "utf-8", "utf-8") >> >> It is, of course, an expensive way of doing nothing, but >> it at least has the virtue of being good documentation. >> > OK, I've fiddled around a bit more but I still haven't managed to get it > to work. I get the fact that its not the FTP operation thats causing the > problem so it must be either the xml.minidom.parse() function (and > whatever sort of file I give that) or the way that I write my results to > output files after I've done my DOM processing. I'll post some more > detailed code: Please post _all_ of the relevant code. It wastes people's time when you post incomplete examples. The critical issue is frequently in the part that you didn't post. > > def open_file(file_name): >ftp = ftplib.FTP(self.host) >ftp.login(self.login, self.passwd) > >content_file = file(file_name, 'w+b') >ftp.retrbinary("RETR " + self.path, content_file.write) >ftp.quit() >content_file.close() > >## Case 1: >#self.document = parse(file_name) > >## Case 2: >#self.document = parse(codecs.open(file_name, 'r+b', "utf-8")) > ># Case 3: >content_file = codecs.open(file_name, 'r', "utf-8") >self.document = parse(codecs.EncodedFile(content_file, "utf-8", >"utf-8")) >content_file.close() > > In Case1 I get the incorrectly encoded characters. > > In Case 2 I get the exception: > "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in > position 5208: ordinal not in range(128)" > when it calls the xml.minidom.parse() function. > > In Case 3 I get the exception: > "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in > position 5208: ordinal not in range(128)" > when it calls the xml.minidom.parse() function. That's exactly what you should expect. In the first case, the file on disk is encoded as utf-8, and this is aparently what mini-dom is expecting. The documentation shows a simple read, it does not show any kind of encoding or decoding. > Anyway, later on in the program I create a *very* large unicode string > after doing some playing with the DOM tree. I then write this to a file > using: > html_file = codecs.open(file_name, "w+b", "utf8") > html_file.write(very_large_unicode_string) > > The problem could be here? That should work. The problem, as I said in the first post, is that whatever program you are using to render the file to screen or print is _not_ treating the file as utf-8 encoded. It either needs to be told that the file is in utf-8 encoding, or you need to get a better rendering program. Many renderers, including most renderers inside of programming tools like file inspectors and debuggers, assume that the encoding is latin-1 or windows-1252. This will throw up funny characters if you try to read a utf-8 (or any multi-byte encoded) file using them. One trick that sometimes works is to insure that the first character is the BOM (byte order mark, or unicode signature). Properly written Windows programs will use this as an encoding signature. Unixoid programs frequently won't, but that's arguably a violation of the Unicode standard. This is a single unicode character which is three characters in utf-8 encoding. John Roth > > Cheers, > Richard -- http://mail.python.org/mailman/listinfo/python-list
Re: utf8 and ftplib
On Thu, 16 Jun 2005 12:06:50 -0600, "John Roth" <[EMAIL PROTECTED]> said: > "Richard Lewis" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > > Hi there, > > > > I'm having a problem with unicode files and ftplib (using Python 2.3.5). > > > > I've got this code: > > > > xml_source = codecs.open("foo.xml", 'w+b', "utf8") > > #xml_source = file("foo.xml", 'w+b') > > > > ftp.retrbinary("RETR foo.xml", xml_source.write) > > #ftp.retrlines("RETR foo.xml", xml_source.write) > > > > It looks like there are at least two problems here. The major one > is that you seem to have a misconception about utf-8 encoding. > Who doesn't? ;-) > > Whatever program you are using to read it has to then decode > it from utf-8 into unicode. Failure to do this is what is causing > the extra characters on output. > > > Amusingly, this would have worked: > > xml_source = codecs.EncodedFile("foo.xml", "utf-8", "utf-8") > > It is, of course, an expensive way of doing nothing, but > it at least has the virtue of being good documentation. > OK, I've fiddled around a bit more but I still haven't managed to get it to work. I get the fact that its not the FTP operation thats causing the problem so it must be either the xml.minidom.parse() function (and whatever sort of file I give that) or the way that I write my results to output files after I've done my DOM processing. I'll post some more detailed code: def open_file(file_name): ftp = ftplib.FTP(self.host) ftp.login(self.login, self.passwd) content_file = file(file_name, 'w+b') ftp.retrbinary("RETR " + self.path, content_file.write) ftp.quit() content_file.close() ## Case 1: #self.document = parse(file_name) ## Case 2: #self.document = parse(codecs.open(file_name, 'r+b', "utf-8")) # Case 3: content_file = codecs.open(file_name, 'r', "utf-8") self.document = parse(codecs.EncodedFile(content_file, "utf-8", "utf-8")) content_file.close() In Case1 I get the incorrectly encoded characters. In Case 2 I get the exception: "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 5208: ordinal not in range(128)" when it calls the xml.minidom.parse() function. In Case 3 I get the exception: "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 5208: ordinal not in range(128)" when it calls the xml.minidom.parse() function. The character at position 5208 is an 'a' (assuming Emacs' goto-char function has the same idea about file positions as xml.minidom.parse()?). When I first tried these two new cases it came up with an unencodable character at another position. By replacing the large dash at this position with an ordinary minus sign I stopped it from raising the exception at that point in the file. I checked the character xe6 and (assuming I know what I'm doing) its a small ae ligature. Anyway, later on in the program I create a *very* large unicode string after doing some playing with the DOM tree. I then write this to a file using: html_file = codecs.open(file_name, "w+b", "utf8") html_file.write(very_large_unicode_string) The problem could be here? Cheers, Richard -- http://mail.python.org/mailman/listinfo/python-list
Re: utf8 and ftplib
Richard Lewis wrote: > Hi there, > > I'm having a problem with unicode files and ftplib (using Python 2.3.5). > > I've got this code: > > xml_source = codecs.open("foo.xml", 'w+b', "utf8") > #xml_source = file("foo.xml", 'w+b') > > ftp.retrbinary("RETR foo.xml", xml_source.write) > #ftp.retrlines("RETR foo.xml", xml_source.write) > > It opens a new local file using utf8 encoding and then reads from a file > on an FTP server (also utf8 encoded) into that local file. It comes up > with an error, however, on calling the xml_source.write callback (I > think) saying that: > > "File "myscript.py", line 75, in get_content > ftp.retrbinary("RETR foo.xml", xml_source.write) > File "/usr/lib/python2.3/ftplib.py", line 384, in retrbinary > callback(data) > File "/usr/lib/python2.3/codecs.py", line 400, in write > return self.writer.write(data) > File "/usr/lib/python2.3/codecs.py", line 178, in write > data, consumed = self.encode(object, self.errors) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 76: > ordinal not in range(128)" > > I've tried using both the commented lines of code in the above example > (i.e. using file() instead of codecs.open() and retlines() instead of > retbinary()). retlines() makes no difference, but if I use file() > instead of codecs.open() I can open the file, but the extended > characters from the source file (e.g. foreign characters, copyright > symbol, etc.) all appear with an extra character in front of them > (because of the two char width in utf8?). Saying "appear with an extra character in front of them" is close to useless for diagnostic purposes -- print repr(sample_string) would be more informative. In any case, the file with the "foreign" [attitude?] characters may well be what you want. > > Is the xml_source.write callback causing the problem here? Or is it > something else? Is there any way that I can correctly retrieve a utf8 > encoded file from an FTP server? To get an exact copy of a file via FTP -- doesn't matter whether it's encoded in utf8 or ESCII or whatever -- use the following combination: xml_source = file("foo.xml", 'w+b') ftp.retrbinary("RETR foo.xml", xml_source.write) If you were using a command-line FTP client, you would use the "binary" command before doing a "get" or "mget". HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: utf8 and ftplib
"Richard Lewis" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Hi there, > > I'm having a problem with unicode files and ftplib (using Python 2.3.5). > > I've got this code: > > xml_source = codecs.open("foo.xml", 'w+b', "utf8") > #xml_source = file("foo.xml", 'w+b') > > ftp.retrbinary("RETR foo.xml", xml_source.write) > #ftp.retrlines("RETR foo.xml", xml_source.write) > > It opens a new local file using utf8 encoding and then reads from a file > on an FTP server (also utf8 encoded) into that local file. It comes up > with an error, however, on calling the xml_source.write callback (I > think) saying that: > > "File "myscript.py", line 75, in get_content > ftp.retrbinary("RETR foo.xml", xml_source.write) > File "/usr/lib/python2.3/ftplib.py", line 384, in retrbinary > callback(data) > File "/usr/lib/python2.3/codecs.py", line 400, in write > return self.writer.write(data) > File "/usr/lib/python2.3/codecs.py", line 178, in write > data, consumed = self.encode(object, self.errors) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 76: > ordinal not in range(128)" > > I've tried using both the commented lines of code in the above example > (i.e. using file() instead of codecs.open() and retlines() instead of > retbinary()). retlines() makes no difference, but if I use file() > instead of codecs.open() I can open the file, but the extended > characters from the source file (e.g. foreign characters, copyright > symbol, etc.) all appear with an extra character in front of them > (because of the two char width in utf8?). > > Is the xml_source.write callback causing the problem here? Or is it > something else? Is there any way that I can correctly retrieve a utf8 > encoded file from an FTP server? It looks like there are at least two problems here. The major one is that you seem to have a misconception about utf-8 encoding. The _disk_ version of the file is what is encoded in utf-8, and it has to be decoded to unicode on being read later. In other words, what you got is what you should have put on disk without any conversion. As you noted, when you did that, the FTP part of the process worked. Whatever program you are using to read it has to then decode it from utf-8 into unicode. Failure to do this is what is causing the extra characters on output. The object returned by codecs.open raised an exception because it expected a unicode string on input; it got a character string already encoded in utf-8 format. The internal mechanism is first going to try to decode that into unicode before then encoding it into utf-8. Unfortunately, the default for encoding or decoding (outside of special contexts) is ASCII-7. So everything outside of the ASCII range is invalid. Amusingly, this would have worked: xml_source = codecs.EncodedFile("foo.xml", "utf-8", "utf-8") It is, of course, an expensive way of doing nothing, but it at least has the virtue of being good documentation. HTH John Roth > > Cheers, > Richard -- http://mail.python.org/mailman/listinfo/python-list
utf8 and ftplib
Hi there, I'm having a problem with unicode files and ftplib (using Python 2.3.5). I've got this code: xml_source = codecs.open("foo.xml", 'w+b', "utf8") #xml_source = file("foo.xml", 'w+b') ftp.retrbinary("RETR foo.xml", xml_source.write) #ftp.retrlines("RETR foo.xml", xml_source.write) It opens a new local file using utf8 encoding and then reads from a file on an FTP server (also utf8 encoded) into that local file. It comes up with an error, however, on calling the xml_source.write callback (I think) saying that: "File "myscript.py", line 75, in get_content ftp.retrbinary("RETR foo.xml", xml_source.write) File "/usr/lib/python2.3/ftplib.py", line 384, in retrbinary callback(data) File "/usr/lib/python2.3/codecs.py", line 400, in write return self.writer.write(data) File "/usr/lib/python2.3/codecs.py", line 178, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 76: ordinal not in range(128)" I've tried using both the commented lines of code in the above example (i.e. using file() instead of codecs.open() and retlines() instead of retbinary()). retlines() makes no difference, but if I use file() instead of codecs.open() I can open the file, but the extended characters from the source file (e.g. foreign characters, copyright symbol, etc.) all appear with an extra character in front of them (because of the two char width in utf8?). Is the xml_source.write callback causing the problem here? Or is it something else? Is there any way that I can correctly retrieve a utf8 encoded file from an FTP server? Cheers, Richard -- http://mail.python.org/mailman/listinfo/python-list