Re: extract Infobox contents

Rhodri James Tue, 07 Apr 2009 17:58:25 -0700

On Tue, 07 Apr 2009 12:46:18 +0100, J. Clifford Dyer<[email protected]> wrote:

On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote:

On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain
<[email protected]> wrote:


> Hi,
> I was trying to extract wikipedia Infobox contents which is in format
> like given below, from the opened URL page in Python.
>
> {{ Infobox Software
> | name                   = Bash

[snip]

> | latest release date    = {{release date|mf=yes|2009|02|20}}
> | programming language   = [[C (programming language)|C]]
> | operating system       = [[Cross-platform]]
> | platform               = [[GNU]]
> | language               = English, multilingual ([[gettext]])
> | status                 = Active

[snip some more]

> }} //upto this line
>
> I need to extract all data between {{ Infobox ...to }}


[snip still more]

You end up with 'infoboxes' containing a list of all the infoboxes
on the page, each held as a list of the lines of their content.
For safety's sake you really should be using regular expressions
rather than 'startswith', but I leave that as an exercise for the
reader :-)


I agree that startswith isn't the right option, but for matching two
constant characters, I don't think re is necessary.  I'd just do:

if '}}' in line:
    pass

Then, as the saying goes, you only have one problem.


That would be the problem of matching lines like:

 | latest release date    = {{release date|mf=yes|2009|02|20}}

would it? :-)

A quick bit of timing suggests that:

  if line.lstrip().startswith("}}"):
    pass

is what we actually want.

--
Rhodri James *-* Wildebeeste Herder to the Masses
--
http://mail.python.org/mailman/listinfo/python-list

Re: extract Infobox contents

Reply via email to