Re: [backstage] Plain text or easy-to-parse news articles

Matthew Somerville Fri, 27 Jul 2007 03:23:07 -0700

Liam S Docherty wrote:

The low graphics version sufffer from the same problem, in that the html
is not considered well formed by the standard Java parsers.  I suppose I
could try tidy up the html before parsing =)

In this situation, I'd always suggest BeautifulSoup, but I'm afraid that'sPython, not Java. But in case it's useful anyway, here's a simple scriptthat extracts the heading and body of a BBC news story with it:


--8<---------------------------------
#!/usr/local/bin/python

from BeautifulSoup import BeautifulSoup
import urllib

f = urllib.urlopen('http://news.bbc.co.uk/1/hi/uk_politics/6918266.stm')
html = f.read()
f.close()

soup = BeautifulSoup(html)
table = soup.findAll('table', width=629)[1]
heading = table.find('div', {'class':'sh'}) # Perhaps mxb?
body = table.findAll('tr')[1].find('td')
crufts = body.findAll('div', {'class':'mvtb'})
[cruft.extract() for cruft in crufts]
print "%s\n\n%s" % (heading.renderContents(), body.renderContents())
--8<---------------------------------

ATB,
Matthew
--
http://www.dracos.co.uk/

-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/

Re: [backstage] Plain text or easy-to-parse news articles

Reply via email to