Re: [Tutor] Titles from a web page

2011-05-05 Thread Alan Gauld


louis leichtnam l.leicht...@gmail.com wrote

I'm trying to write a program that looks in a webpage in find the 
titles of

a subsection of the page:

Can you help me out? I tried using regular expression but I keep 
hitting

walls and I don't know what to do...


Regular expressions are the wrong tool for parsing HTML unless
you are searching for something very simple.

There is an html parser in the Python standard library (*) that you
can use if the HTML is reasonably well formed. If its sloppy you
would be better with something like BeautifulSoup or lxml.

If the page is written in XHTML then you could also use the
element tree module which is designed for XML parsing.

(*)In fact there are two! - htmllib and HTMLParser. The former is more
powerful but more complex. Some brief examples can be found
in my tutor here:

http://www.alan-g.me.uk/tutor/tutwebc.htm

Note, the topic is not complete, the last few sections are
placeholders only...

HTH,

Alan G. 



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Titles from a web page

2011-05-04 Thread louis leichtnam
Hello Everyone,

I'm trying to write a program that looks in a webpage in find the titles of
a subsection of the page:

For example you have the list of the title of stories in a newspaper under
the section World and you you click on it you have the entire story.

I want a program that print the title only of this special section of the
page.

Can you help me out? I tried using regular expression but I keep hitting
walls and I don't know what to do...

Thank you
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Titles from a web page

2011-05-04 Thread Modulok
You might look into the third party module, 'BeautifulSoup'. It's designed to
help you interrogate markup (even poor markup), extracting nuggets of data based
on various criteria.

-Modulok-

On 5/4/11, louis leichtnam l.leicht...@gmail.com wrote:
 Hello Everyone,

 I'm trying to write a program that looks in a webpage in find the titles of
 a subsection of the page:

 For example you have the list of the title of stories in a newspaper under
 the section World and you you click on it you have the entire story.

 I want a program that print the title only of this special section of the
 page.

 Can you help me out? I tried using regular expression but I keep hitting
 walls and I don't know what to do...

 Thank you

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Titles from a web page

2011-05-04 Thread James Mills
On Thu, May 5, 2011 at 1:52 PM, Modulok modu...@gmail.com wrote:
 You might look into the third party module, 'BeautifulSoup'. It's designed to
 help you interrogate markup (even poor markup), extracting nuggets of data 
 based
 on various criteria.

lxml is also work looking into which provides similar functionality.


-- 
-- James Mills
--
-- Problems are solved by method
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Titles from a web page

2011-05-04 Thread Michiel Overtoom

On May 5, 2011, at 07:16, James Mills wrote:

 On Thu, May 5, 2011 at 1:52 PM, Modulok modu...@gmail.com wrote:
 You might look into the third party module, 'BeautifulSoup'. It's designed to
 help you interrogate markup (even poor markup), extracting nuggets of data 
 based
 on various criteria.
 
 lxml is also work looking into which provides similar functionality.

For especially broken markup you might even consider version 3.07a of 
BeautifulSoup.  The parser in later versions got slightly less forgiving.

Greetings,

-- 
Control over the use of one's ideas really constitutes control over other 
people's lives; and it is usually used to make their lives more difficult. - 
Richard Stallman

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor