Op woensdag 5 september 2012 19:23:45 UTC+2 schreef BobAalsma het volgende:
> Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:
> 
> > I'm trying to understand the HTMLParser so I've copied some code from 
> > http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser 
> > and tried that on my LinkedIn page.
> 
> > 
> 
> > No errors, but some of the tags seem to go missing for no apparent reason - 
> > any advice?
> 
> > 
> 
> > I have searched extensively for this, but seem to be the only one with 
> > missing data from HTMLParser :(
> 
> > 
> 
> > 
> 
> > 
> 
> > Code:
> 
> > 
> 
> > import urllib2
> 
> > 
> 
> > from HTMLParser import HTMLParser
> 
> > 
> 
> > 
> 
> > 
> 
> > from GetHttpFileContents import getHttpFileContents
> 
> > 
> 
> > 
> 
> > 
> 
> > # create a subclass and override the handler methods
> 
> > 
> 
> > class MyHTMLParser(HTMLParser):
> 
> > 
> 
> >     def handle_starttag(self, tag, attrs):
> 
> > 
> 
> >             print "Start tag:\n\t", tag
> 
> > 
> 
> >             for attr in attrs:
> 
> > 
> 
> >                     print "\t\tattr:", attr
> 
> > 
> 
> >             # end for attr in attrs:
> 
> > 
> 
> >     #
> 
> > 
> 
> >     def handle_endtag(self, tag):
> 
> > 
> 
> >             print "End tag :\n\t", tag
> 
> > 
> 
> >     #
> 
> > 
> 
> >     def handle_data(self, data):
> 
> > 
> 
> >             if data != '\n\n':
> 
> > 
> 
> >                     if data != '\n':
> 
> > 
> 
> >                             print "Data :\t\t", data
> 
> > 
> 
> >                     # end if 1
> 
> > 
> 
> >             # end if 2
> 
> > 
> 
> >     #
> 
> > 
> 
> > #
> 
> > 
> 
> > # ---------------------------------------------------------------------
> 
> > 
> 
> > #
> 
> > 
> 
> > def removeHtmlFromFileContents():
> 
> > 
> 
> >     TextOut = ''
> 
> > 
> 
> > 
> 
> > 
> 
> >     parser = MyHTMLParser()
> 
> > 
> 
> >     
> > parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())
> 
> > 
> 
> > 
> 
> > 
> 
> >     return TextOut
> 
> > 
> 
> > #
> 
> > 
> 
> > # ---------------------------------------------------------------------
> 
> > 
> 
> > #
> 
> > 
> 
> > if __name__ == '__main__':
> 
> > 
> 
> >     TextOut = removeHtmlFromFileContents()
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > Part of the output:
> 
> > 
> 
> > End tag :
> 
> > 
> 
> >     script
> 
> > 
> 
> > Start tag:
> 
> > 
> 
> >     title
> 
> > 
> 
> > Data :              Bob Aalsma - Nederland | LinkedIn
> 
> > 
> 
> > End tag :
> 
> > 
> 
> >     title
> 
> > 
> 
> > Start tag:
> 
> > 
> 
> >     script
> 
> > 
> 
> >             attr: ('type', 'text/javascript')
> 
> > 
> 
> >             attr: ('src', 
> > 'http://www.linkedin.com/uas/authping?url=http%3A%2F%2Fnl%2Elinkedin%2Ecom%2Fin%2Fbobaalsma')
> 
> > 
> 
> > End tag :
> 
> > 
> 
> >     script
> 
> > 
> 
> > Start tag:
> 
> > 
> 
> >     link
> 
> > 
> 
> >             attr: ('rel', 'stylesheet')
> 
> > 
> 
> >             attr: ('type', 'text/css')
> 
> > 
> 
> >             attr: ('href', 
> > 'http://s3.licdn.com/scds/concat/common/css?h=5v4lkweptdvona6w56qelodrj-7pfvsr76gzb22ys278pbj80xm-b1io9ndljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
> 
> > 
> 
> > Start tag:
> 
> > 
> 
> >     script
> 
> > 
> 
> >             attr: ('type', 'text/javascript')
> 
> > 
> 
> >             attr: ('src', 
> > 'http://s4.licdn.com/scds/concat/common/js?h=7nhn6ycbvnz80dydsu88wbuk-1kjdwxpxv0c3z97afuz9dlr9g-dlsf699o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
> 
> > 
> 
> > End tag :
> 
> > 
> 
> >     script
> 
> > 
> 
> > End tag :
> 
> > 
> 
> >     head
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > But the source text for this is [and all of the "<meta ...> seem to go 
> > missing:
> 
> > 
> 
> > </script>
> 
> > 
> 
> > <title>Bob Aalsma | LinkedIn</title>
> 
> > 
> 
> > <link rel="stylesheet" type="text/css" 
> > href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5";>
> 
> > 
> 
> > <link rel="stylesheet" type="text/css" 
> > href="https://s4-s.licdn.com/scds/concat/common/css?h=b1io9ndljf1bvpack85gyxhv4-6qrj4gxbwq8loasfnyfmyuphe-dhog2e5h8scik4whkpqccnzou-dmo1gwj6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo";>
> 
> > 
> 
> > <meta name="LinkedInBookmarkType" content="profile">
> 
> > 
> 
> > <meta name="ShortTitle" content="Bob Aalsma">
> 
> > 
> 
> > <meta name="Description" content="Bob Aalsma: Project Manager at DripFeed 
> > in the Information Services industry (Amsterdam Area, Netherlands)">
> 
> > 
> 
> > <meta name="UniqueID" content="24198692">
> 
> > 
> 
> > <meta name="SaveURL" 
> > content="/profile/view?id=24198692&amp;authType=name&amp;authToken=KhOG">
> 
> > 
> 
> > </head>
> 
> 
> 
> Hmm, OK, Peter, thanks. I didn't consider the effect of logging in, that 
> could certainly be a reason. So how could I have the script log in?
> 
> 
> 
> [Didn't understand the bit about the kittens, though. How about that?]

Oops, sorry, found that bit about logging in - asked too soon; still wonder 
about the kittens ;)
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to