dave selby wrote:

I split the HTML and print text and I get loads of

\x00T\x00r\x00i\x00a\x00  ie I get \x00 breaking up every character.

Any idea what is happening and how to get back to a list of ascii strings ?


How did you generate the HTML file? What other applications have you used to save the document?

Something in the tool chain before it reached Python has saved it using a wide (four byte) encoding, most likely UTF-16 as that is widely used by Windows and Java. With the right settings, it could take as little as opening the file in Notepad, then clicking Save.

If this isn't making sense to you, you should read this:

http://www.joelonsoftware.com/articles/Unicode.html

If my guess is right that the file is UTF-16, then you can "fix" it by doing this:


# Untested.
f = open("my_html_file.html", "r")
text = f.read().decode("utf-16")  # convert bytes to text
f.close()
bytes = text.encode("ascii")  # If this fails, try "latin-1" instead
f = open("my_html_file2.html", "w")  # write bytes back to disk
f.write(bytes)
f.close()

Once you've inspected the re-written file my_html_file2.html and it is okay to your satisfaction, you can delete the original one.


--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to