Mike Meyer wrote:
True enough...however, it doesn't fail too horribly:
It also fails on tags with a ">" in a string in the tag. That's well-formed but ill-used HTML.
<mike
>>> striptags("""<sometag attribute = '>'>the text</sometag>""")
"'>the text"
>>>
and I think that case could be rectified rather easily, by stripping any content up to '>' in the result without breaking anything else.
BTW, I tool a first look at BeautifulSoup. As far as I could tell, there is no built-in way to extract text from its parse tree, however adding one is trivial:
>>> from bsoup import BeautifulSoup, Tag ... >>> def extracttext(obj): ... if isinstance(obj,Tag): ... return "".join(extracttext(c) for c in obj.contents) ... else: ... return str(obj) ... >>> def bsouptext(text): ... souptree = BeautifulSoup(text) ... bodytext = extracttext(souptree.first()) ... text = re.sub(comments,'', bodytext) ... text = collapsenewlines(text) ... return text ... ... >>>
>>> bsouptext("""<sometag attribute = '>'>the text</sometag>""") "'>the text"
On one 'real world test' (nytimes.com), I find the regexp approach to be more accurate, but I won't load up this message with the output to prove it ;-)
Michael
-- http://mail.python.org/mailman/listinfo/python-list