Re: Converting HTML to ASCII

Mike Meyer Fri, 25 Feb 2005 13:15:05 -0800

Michael Spencer <[EMAIL PROTECTED]> writes:

> gf gf wrote:
>> [wants to extract ASCII from badly-formed HTML and thinks BeautifulSoup is 
>> too complex]
>
> You haven't specified what you mean by "extracting" ASCII, but I'll
> assume that you want to start by eliminating html tags and comments,
> which is easy enough with a couple of regular expressions:
>
>   >>> import re
>   >>> comments = re.compile('<!--.*?-->', re.DOTALL)
>   >>> tags = re.compile('<.*?>', re.DOTALL)
>   ...
>   >>> def striptags(text):
>   ...     text = re.sub(comments,'', text)
>   ...     text = re.sub(tags,'', text)
>   ...     return text
>   ...
>   >>> def collapsenewlines(text):
>   ...     return "\n".join(line for line in text.splitlines() if line)
>   ...
>   >>> import urllib2
>   >>> f = urllib2.urlopen('http://www.python.org/')
>   >>> source = f.read()
>   >>> text = collapsenewlines(striptags(source))
>   >>>
>
> This will of course fail if there is a "<" without a ">", probably in
> other cases too.  But it is indifferent to whether the html is
> well-formed.


It also fails on tags with a ">" in a string in the tag. That's
well-formed but ill-used HTML.

            <mike
-- 
Mike Meyer <[EMAIL PROTECTED]>                  http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Converting HTML to ASCII

Reply via email to