Re: [Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

Steven D'Aprano Sun, 20 Nov 2011 13:47:46 -0800

dave selby wrote:

I split the HTML and print text and I get loads of


\x00T\x00r\x00i\x00a\x00  ie I get \x00 breaking up every character.

Any idea what is happening and how to get back to a list of ascii strings ?

How did you generate the HTML file? What other applications have youused to save the document?

Something in the tool chain before it reached Python has saved it usinga wide (four byte) encoding, most likely UTF-16 as that is widely usedby Windows and Java. With the right settings, it could take as little asopening the file in Notepad, then clicking Save.


If this isn't making sense to you, you should read this:

http://www.joelonsoftware.com/articles/Unicode.html

If my guess is right that the file is UTF-16, then you can "fix" it bydoing this:



# Untested.
f = open("my_html_file.html", "r")
text = f.read().decode("utf-16")  # convert bytes to text
f.close()
bytes = text.encode("ascii")  # If this fails, try "latin-1" instead
f = open("my_html_file2.html", "w")  # write bytes back to disk
f.write(bytes)
f.close()

Once you've inspected the re-written file my_html_file2.html and it isokay to your satisfaction, you can delete the original one.



--
Steven
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

Reply via email to