Op woensdag 5 september 2012 19:23:45 UTC+2 schreef BobAalsma het volgende: > Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende: > > > I'm trying to understand the HTMLParser so I've copied some code from > > http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser > > and tried that on my LinkedIn page. > > > > > > No errors, but some of the tags seem to go missing for no apparent reason - > > any advice? > > > > > > I have searched extensively for this, but seem to be the only one with > > missing data from HTMLParser :( > > > > > > > > > > > > Code: > > > > > > import urllib2 > > > > > > from HTMLParser import HTMLParser > > > > > > > > > > > > from GetHttpFileContents import getHttpFileContents > > > > > > > > > > > > # create a subclass and override the handler methods > > > > > > class MyHTMLParser(HTMLParser): > > > > > > def handle_starttag(self, tag, attrs): > > > > > > print "Start tag:\n\t", tag > > > > > > for attr in attrs: > > > > > > print "\t\tattr:", attr > > > > > > # end for attr in attrs: > > > > > > # > > > > > > def handle_endtag(self, tag): > > > > > > print "End tag :\n\t", tag > > > > > > # > > > > > > def handle_data(self, data): > > > > > > if data != '\n\n': > > > > > > if data != '\n': > > > > > > print "Data :\t\t", data > > > > > > # end if 1 > > > > > > # end if 2 > > > > > > # > > > > > > # > > > > > > # --------------------------------------------------------------------- > > > > > > # > > > > > > def removeHtmlFromFileContents(): > > > > > > TextOut = '' > > > > > > > > > > > > parser = MyHTMLParser() > > > > > > > > parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read()) > > > > > > > > > > > > return TextOut > > > > > > # > > > > > > # --------------------------------------------------------------------- > > > > > > # > > > > > > if __name__ == '__main__': > > > > > > TextOut = removeHtmlFromFileContents() > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Part of the output: > > > > > > End tag : > > > > > > script > > > > > > Start tag: > > > > > > title > > > > > > Data : Bob Aalsma - Nederland | LinkedIn > > > > > > End tag : > > > > > > title > > > > > > Start tag: > > > > > > script > > > > > > attr: ('type', 'text/javascript') > > > > > > attr: ('src', > > 'http://www.linkedin.com/uas/authping?url=http%3A%2F%2Fnl%2Elinkedin%2Ecom%2Fin%2Fbobaalsma') > > > > > > End tag : > > > > > > script > > > > > > Start tag: > > > > > > link > > > > > > attr: ('rel', 'stylesheet') > > > > > > attr: ('type', 'text/css') > > > > > > attr: ('href', > > 'http://s3.licdn.com/scds/concat/common/css?h=5v4lkweptdvona6w56qelodrj-7pfvsr76gzb22ys278pbj80xm-b1io9ndljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69') > > > > > > Start tag: > > > > > > script > > > > > > attr: ('type', 'text/javascript') > > > > > > attr: ('src', > > 'http://s4.licdn.com/scds/concat/common/js?h=7nhn6ycbvnz80dydsu88wbuk-1kjdwxpxv0c3z97afuz9dlr9g-dlsf699o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo') > > > > > > End tag : > > > > > > script > > > > > > End tag : > > > > > > head > > > > > > > > > > > > > > > > > > > > > > > > But the source text for this is [and all of the "<meta ...> seem to go > > missing: > > > > > > </script> > > > > > > <title>Bob Aalsma | LinkedIn</title> > > > > > > <link rel="stylesheet" type="text/css" > > href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5"> > > > > > > <link rel="stylesheet" type="text/css" > > href="https://s4-s.licdn.com/scds/concat/common/css?h=b1io9ndljf1bvpack85gyxhv4-6qrj4gxbwq8loasfnyfmyuphe-dhog2e5h8scik4whkpqccnzou-dmo1gwj6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo"> > > > > > > <meta name="LinkedInBookmarkType" content="profile"> > > > > > > <meta name="ShortTitle" content="Bob Aalsma"> > > > > > > <meta name="Description" content="Bob Aalsma: Project Manager at DripFeed > > in the Information Services industry (Amsterdam Area, Netherlands)"> > > > > > > <meta name="UniqueID" content="24198692"> > > > > > > <meta name="SaveURL" > > content="/profile/view?id=24198692&authType=name&authToken=KhOG"> > > > > > > </head> > > > > Hmm, OK, Peter, thanks. I didn't consider the effect of logging in, that > could certainly be a reason. So how could I have the script log in? > > > > [Didn't understand the bit about the kittens, though. How about that?]
Oops, sorry, found that bit about logging in - asked too soon; still wonder about the kittens ;) -- http://mail.python.org/mailman/listinfo/python-list