On 20 ago, 15:44, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:
> ---------------------------------------------------------- > f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO())) > parser = htmllib.HTMLParser(f) > parser.feed(html) > parser.close() > return parser.anchorlist > ---------------------------------------------------------- The htmllib.HTMLParser class is hard to use. I would replace those lines with: from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.anchorlist = [] def handle_starttag(self, tag, attrs): if tag=="a": href = dict(attrs).get("href") if href: self.anchorlist.append(href) parser = MyHTMLParser() parser.feed(htmltext) print parser.anchorlist The anchorlist attribute, defined by myself here, is a list containing all href attributes found in the page. See <http://docs.python.org/lib/module-HTMLParser.html> > I get the idea that we're allocating some memory that looks like a > file so formatter.dumbwriter can manipulate it. The results are > passed to formatter.abstractformatter which does something else to the > HTML code. The results are then passed to "f" which is then passed to > htmllib.HTMLParser so it can parse the html for links. I guess I > don't understand with any great detail as to why this is happening. > I know someone is going to say that I should RTFM so here is the gist > of the documentation: Don't even try to understand it - it's a mess. Use the HTMLParser module instead. > The last question is.. I can't find any documentation to explain > where the "anchorlist" attribute came from? Here is the only > reference to this attribute that I can find anywhere in the Python > documentation. And that's all you will find. > So .. How does an average developer figure out that parser returns a > list of hyperlinks in an attribute called anchorlist? Is this Usually, those attributes are hyperlinked and you can find them in the documentation index. Not for this one :( > something that you just "figure out" or is there some book I should be > reading that documents all of the attributes for a particular > method? It just seems a bit obscure and certainly not something I > would have figured out on my own. Does this make me a poor developer > who should find another hobby? I just need to know if there is > something wrong with me or if this is a reasonable question to ask. It's a very reasonable question. The attribute should be documented properly. But the class itself is a bit old; I don't never use it anymore. > The last question I have is about debugging. The spider is capable > of parsing links until it reaches: > > "html = get_page(http://www.google.com/jobs/fortune)" which returns > the contents of a pdf document, assigns the pdf contents to html which > is later passed to parser.feed(html) which crashes. You can verify the Content-Type header before processing. Quoting the get_page method: > def get_page(url, log): > """Retrieve URL and return comments, log errors.""" > try: > page = urllib2.urlopen(url) > except urllib2.URLError: > log("Error retrieving: " + url) > return '' > body = page.read() > page.close() > return body >From <http://docs.python.org/lib/module-urllib2.html>, the urlopen method returns a file-like object, which has an additional info() method holding the response headers. You can get the Content-Type using page.info().gettype(), which should be text/html or text/xhtml. For any other type, just return '' as you do for any error. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list