On 20-Nov-11 12:04, Sarma Tangirala wrote:
Would the html parser library in python be a better idea as opposed to
using split? That way you have greater control over what is in the html.

Absolutely. And it would handle improper HTML (like unmatched brackets) gracefully where the split will just do the wrong thing.


On 20 Nov 2011 23:58, "dave selby" <dave6...@gmail.com
<mailto:dave6...@gmail.com>> wrote:

    Hi All,

    I have a long string which is an HTML file, I strip the HTML tags away
    and make a list with

    text = re.split('<.*?>', HTML)

    I then tried to search for a string with text.index(...) but it was
    not found, printing HTML to a terminal I get what I expect, a block of
    tags and text, I split the HTML and print text and I get loads of

    \x00T\x00r\x00i\x00a\x00  ie I get \x00 breaking up every character.

    Any idea what is happening and how to get back to a list of ascii
    strings ?

    Cheers

    Dave

    --

    Please avoid sending me Word or PowerPoint attachments.
    See http://www.gnu.org/philosophy/no-word-attachments.html
    _______________________________________________
    Tutor maillist  - Tutor@python.org <mailto:Tutor@python.org>
    To unsubscribe or change subscription options:
    http://mail.python.org/mailman/listinfo/tutor



_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


--
Steve Willoughby / st...@alchemy.com
"A ship in harbor is safe, but that is not what ships are built for."
PGP Fingerprint 4615 3CCE 0F29 AE6C 8FF4 CA01 73FE 997A 765D 696C
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to