Michael Spencer <[EMAIL PROTECTED]> writes: > gf gf wrote: >> [wants to extract ASCII from badly-formed HTML and thinks BeautifulSoup is >> too complex] > > You haven't specified what you mean by "extracting" ASCII, but I'll > assume that you want to start by eliminating html tags and comments, > which is easy enough with a couple of regular expressions: > > >>> import re > >>> comments = re.compile('<!--.*?-->', re.DOTALL) > >>> tags = re.compile('<.*?>', re.DOTALL) > ... > >>> def striptags(text): > ... text = re.sub(comments,'', text) > ... text = re.sub(tags,'', text) > ... return text > ... > >>> def collapsenewlines(text): > ... return "\n".join(line for line in text.splitlines() if line) > ... > >>> import urllib2 > >>> f = urllib2.urlopen('http://www.python.org/') > >>> source = f.read() > >>> text = collapsenewlines(striptags(source)) > >>> > > This will of course fail if there is a "<" without a ">", probably in > other cases too. But it is indifferent to whether the html is > well-formed.
It also fails on tags with a ">" in a string in the tag. That's well-formed but ill-used HTML. <mike -- Mike Meyer <[EMAIL PROTECTED]> http://www.mired.org/home/mwm/ Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information. -- http://mail.python.org/mailman/listinfo/python-list