Re: extract Infobox contents

Rhodri James Mon, 06 Apr 2009 15:42:37 -0700

On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain<[email protected]> wrote:

Hi,
I was trying to extract wikipedia Infobox contents which is in format
like given below, from the opened URL page in Python.


{{ Infobox Software
| name                   = Bash
| logo                   = [[Image:bash-org.png|165px]]
| screenshot             = [[Image:Bash demo.png|250px]]
| caption                = Screenshot of bash and [[Bourne shell|sh]]
sessions demonstrating some features
| developer              = [[Chet Ramey]]
| latest release version = 4.0
| latest release date    = {{release date|mf=yes|2009|02|20}}
| programming language   = [[C (programming language)|C]]
| operating system       = [[Cross-platform]]
| platform               = [[GNU]]
| language               = English, multilingual ([[gettext]])
| status                 = Active
| genre                  = [[Unix shell]]
| source model           = [[Free software]]
| license                = [[GNU General Public License]]
| website                = [http://tiswww.case.edu/php/chet/bash/
bashtop.html Home page]
}} //upto this line

I need to extract all data between {{ Infobox ...to }}

Thank's if anyone can help,
am trying with

s1='{{ Infobox'
s2=len(s1)
pos1=data.find("{{ Infobox")
pos2=data.find("\n",pos2)

pat1=data.find("}}")

but am ending up getting one line at top only.


How are you getting your data?  Assuming that you can arrange to get
it one line at a time, here's a quick and dirty way to extract the
infoboxes on a page.

infoboxes = []
infobox = []
reading_infobox = False

for line in feed_me_lines_somehow():
    if line.startswith("{{ Infobox"):
        reading_infobox = True
    if reading_infobox:
        infobox.append(line)
    if line.startswith("}}"):
        reading_infobox = False
        infoboxes.append(infobox)
        infobox = []

You end up with 'infoboxes' containing a list of all the infoboxes
on the page, each held as a list of the lines of their content.
For safety's sake you really should be using regular expressions
rather than 'startswith', but I leave that as an exercise for the
reader :-)

--
Rhodri James *-* Wildebeeste Herder to the Masses
--
http://mail.python.org/mailman/listinfo/python-list

Re: extract Infobox contents

Reply via email to