On 20-Nov-11 12:04, Sarma Tangirala wrote:
Would the html parser library in python be a better idea as opposed to using split? That way you have greater control over what is in the html.
Absolutely. And it would handle improper HTML (like unmatched brackets) gracefully where the split will just do the wrong thing.
On 20 Nov 2011 23:58, "dave selby" <dave6...@gmail.com <mailto:dave6...@gmail.com>> wrote: Hi All, I have a long string which is an HTML file, I strip the HTML tags away and make a list with text = re.split('<.*?>', HTML) I then tried to search for a string with text.index(...) but it was not found, printing HTML to a terminal I get what I expect, a block of tags and text, I split the HTML and print text and I get loads of \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character. Any idea what is happening and how to get back to a list of ascii strings ? Cheers Dave -- Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html _______________________________________________ Tutor maillist - Tutor@python.org <mailto:Tutor@python.org> To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
-- Steve Willoughby / st...@alchemy.com "A ship in harbor is safe, but that is not what ships are built for." PGP Fingerprint 4615 3CCE 0F29 AE6C 8FF4 CA01 73FE 997A 765D 696C _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor