On 28/05/14 11:42, Mitesh H. Budhabhatti wrote:
Hello Friends,

I am using Python 3.3.3 on Windows 7.  I would like to know what is the
best method to do HTML parsing?  For example, I want to connect to
www.yahoo.com <http://www.yahoo.com> and get all the tags and their values.

The standard library contains a parser module:
html.parser

Which can do what you want, although its a non-trivial exercise.
Basically you define  event handler functions for each type of
parser event. In your case you need handlers for starttag and
data, and maybe, endtag.

Within start-tag you can read the attributes to determine the
tag type so it typically looks like

def handle_starttag(self, name, attributes):
   if name == 'p':
      # process paragraph tag
   elif name == 'tr':
      # process table row
   etc...


However, you might find it easier to use BeautifulSoup which is a third-party package you need to download. Soup tends to handle
badly formed HTML better than the standard parser and works by
reading the whole HTML document into a tree like structure which
you can access, search or traverse...

HTH
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.flickr.com/photos/alangauldphotos

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to