On 10 April 2013 09:44, Jabba Laci <jabba.l...@gmail.com> wrote: > Hi, > > I wonder if there is a nice way to extract a whole HTML table and have the > result in a nice structured format. What I want is to have the lifetime > table at the bottom of this page: > http://en.wikipedia.org/wiki/List_of_Ubuntu_releases (then figure out with a > script until when my Ubuntu release is supported). > > I could do it with BeautifulSoup or lxml but is there a better way? There > should be :)
Instead of parsing HTML, you could just parse the source of the page (available via action=raw): ------------------------------ import urllib2 url = ( 'http://en.wikipedia.org/w/index.php' '?title=List_of_Ubuntu_releases&action=raw' ) source = urllib2.urlopen(url).read() # Table rows are separated with the line "|-" # Then there is a line starting with "|" potential_rows = source.split("\n|-\n|") rows = [] for row in potential_rows: # Rows in the table start with a link (' [[ ... ]]') if row.startswith(" [["): row = [item.strip() for item in row.split("\n|")] rows.append(row) ------------------------------ >>> import pprint >>> pprint.pprint(rows) [['[[Warty Warthog|4.10]]', 'Warty Warthog', '20 October 2004', 'colspan="2" {{Version |o |30 April 2006}}', '2.6.8'], ['[[Hoary Hedgehog|5.04]]', 'Hoary Hedgehog', '8 April 2005', 'colspan="2" {{Version |o |31 October 2006}}', '2.6.10'], ['[[Breezy Badger|5.10]]', 'Breezy Badger', '13 October 2005', 'colspan="2" {{Version |o |13 April 2007}}', '2.6.12'], ['[[Ubuntu 6.06|6.06 LTS]]', 'Dapper Drake', '1 June 2006', '{{Version |o | 14 July 2009}}', '{{Version |o | 1 June 2011}}', '2.6.15'], ['[[Ubuntu 6.10|6.10]]', 'Edgy Eft', '26 October 2006', 'colspan="2" {{Version |o | 25 April 2008}}', '2.6.17'], [...] ] >>> That should give you the info you need (until the wiki page changes too much!) -- Arnaud -- http://mail.python.org/mailman/listinfo/python-list