Mike Meyer wrote:


It also fails on tags with a ">" in a string in the tag. That's well-formed but ill-used HTML.

<mike
True enough...however, it doesn't fail too horribly:
>>> striptags("""<sometag attribute = '>'>the text</sometag>""")
"'>the text"
>>>
and I think that case could be rectified rather easily, by stripping any content up to '>' in the result without breaking anything else.


BTW, I tool a first look at BeautifulSoup.  As far as I could tell, there is no
built-in way to extract text from its parse tree, however adding one is trivial:

 >>> from bsoup import BeautifulSoup, Tag
 ...
 >>> def extracttext(obj):
 ...     if isinstance(obj,Tag):
 ...         return "".join(extracttext(c) for c in obj.contents)
 ...     else:
 ...         return str(obj)
 ...
 >>> def bsouptext(text):
 ...     souptree = BeautifulSoup(text)
 ...     bodytext = extracttext(souptree.first())
 ...     text = re.sub(comments,'', bodytext)
 ...     text = collapsenewlines(text)
 ...     return text
 ...
 ...
 >>>

 >>> bsouptext("""<sometag attribute = '>'>the text</sometag>""")
 "'>the text"

On one 'real world test' (nytimes.com), I find the regexp approach to be more accurate, but I won't load up this message with the output to prove it ;-)

Michael


-- http://mail.python.org/mailman/listinfo/python-list

Reply via email to