On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote: > > Pardon me, but the standard issue Python 2.n (for n in range(5, 2, > > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous > > 200-modules PyXML package installed. And you don't want the 75Kb > > BeautifulSoup? > > I wasn't aware that I had PyXML installed, and can't find a reference > to having it installed in pydocs. ...
Ugh. Found it. Sorry about that, but I still don't understand why there isn't a simple way to do this without using PyXML, BeautifulSoup or libxml2dom. What's the point in having sgmllib, htmllib, HTMLParser, and formatter all built in if I have to use use someone else's modules to write a couple of lines of code that achieve the simple thing I want. I get the feeling that this would be easier if I just broke down and wrote a couple of regular expressions, but it hardly seems a 'pythonic' way of going about things. # get the source (assuming you don't have it locally and have an internet connection) >>> import urllib >>> page = urllib.urlopen("http://diveintopython.org/") >>> source = page.read() >>> page.close() # set up some regex to find tags, strip them out, and correct some formatting oddities >>> import re >>> p = re.compile(r'(<p.*?>.*?</p>)',re.DOTALL) >>> tag_strip = re.compile(r'>(.*?)<',re.DOTALL) >>> fix_format = re.compile(r'\n +',re.MULTILINE) # achieve clean results. >>> paragraphs = re.findall(p,source) >>> text_list = re.findall(tag_strip,paragraphs[5]) >>> text = "".join(text_list) >>> clean_text = re.sub(fix_format," ",text) This works, and is small and easily reproduced, but seems like it would break easily and seems a waste of other *ML specific parsers. -- http://mail.python.org/mailman/listinfo/python-list