subject:"Re\: \[Tutor\] Titles from a web page"

Re: [Tutor] Titles from a web page

2011-05-05 Thread Alan Gauld



"louis leichtnam"  wrote

I'm trying to write a program that looks in a webpage in find the 
titles of

a subsection of the page:

Can you help me out? I tried using regular expression but I keep 
hitting

walls and I don't know what to do...


Regular expressions are the wrong tool for parsing HTML unless
you are searching for something very simple.

There is an html parser in the Python standard library (*) that you
can use if the HTML is reasonably well formed. If its sloppy you
would be better with something like BeautifulSoup or lxml.

If the page is written in XHTML then you could also use the
element tree module which is designed for XML parsing.

(*)In fact there are two! - htmllib and HTMLParser. The former is more
powerful but more complex. Some brief examples can be found
in my tutor here:

http://www.alan-g.me.uk/tutor/tutwebc.htm

Note, the topic is not complete, the last few sections are
placeholders only...

HTH,

Alan G. 



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Titles from a web page

2011-05-04 Thread Michiel Overtoom


On May 5, 2011, at 07:16, James Mills wrote:

> On Thu, May 5, 2011 at 1:52 PM, Modulok  wrote:
>> You might look into the third party module, 'BeautifulSoup'. It's designed to
>> help you interrogate markup (even poor markup), extracting nuggets of data 
>> based
>> on various criteria.
> 
> lxml is also work looking into which provides similar functionality.

For especially broken markup you might even consider version 3.07a of 
BeautifulSoup.  The parser in later versions got slightly less forgiving.

Greetings,

-- 
"Control over the use of one's ideas really constitutes control over other 
people's lives; and it is usually used to make their lives more difficult." - 
Richard Stallman

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Titles from a web page

2011-05-04 Thread James Mills

On Thu, May 5, 2011 at 1:52 PM, Modulok  wrote:
> You might look into the third party module, 'BeautifulSoup'. It's designed to
> help you interrogate markup (even poor markup), extracting nuggets of data 
> based
> on various criteria.

lxml is also work looking into which provides similar functionality.


-- 
-- James Mills
--
-- "Problems are solved by method"
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Titles from a web page

2011-05-04 Thread Modulok

You might look into the third party module, 'BeautifulSoup'. It's designed to
help you interrogate markup (even poor markup), extracting nuggets of data based
on various criteria.

-Modulok-

On 5/4/11, louis leichtnam  wrote:
> Hello Everyone,
>
> I'm trying to write a program that looks in a webpage in find the titles of
> a subsection of the page:
>
> For example you have the list of the title of stories in a newspaper under
> the section "World" and you you click on it you have the entire story.
>
> I want a program that print the title only of this special section of the
> page.
>
> Can you help me out? I tried using regular expression but I keep hitting
> walls and I don't know what to do...
>
> Thank you
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Titles from a web page

Re: [Tutor] Titles from a web page

Re: [Tutor] Titles from a web page

Re: [Tutor] Titles from a web page

4 matches

Site Navigation

Mail list logo

Footer information