Re: extract Infobox contents

2009-04-08 Thread J. Cliff Dyer
On Wed, 2009-04-08 at 01:57 +0100, Rhodri James wrote:
 On Tue, 07 Apr 2009 12:46:18 +0100, J. Clifford Dyer  
 j...@sdf.lonestar.org wrote:
 
  On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote:
  On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain
  anishchapag...@gmail.com wrote:
 
   Hi,
   I was trying to extract wikipedia Infobox contents which is in format
   like given below, from the opened URL page in Python.
  
   {{ Infobox Software
   | name   = Bash
 [snip]
   | latest release date= {{release date|mf=yes|2009|02|20}}
   | programming language   = [[C (programming language)|C]]
   | operating system   = [[Cross-platform]]
   | platform   = [[GNU]]
   | language   = English, multilingual ([[gettext]])
   | status = Active
 [snip some more]
   }} //upto this line
  
   I need to extract all data between {{ Infobox ...to }}
 
 [snip still more]
 
  You end up with 'infoboxes' containing a list of all the infoboxes
  on the page, each held as a list of the lines of their content.
  For safety's sake you really should be using regular expressions
  rather than 'startswith', but I leave that as an exercise for the
  reader :-)
 
 
  I agree that startswith isn't the right option, but for matching two
  constant characters, I don't think re is necessary.  I'd just do:
 
  if '}}' in line:
  pass
 
  Then, as the saying goes, you only have one problem.
 
 That would be the problem of matching lines like:
 
   | latest release date= {{release date|mf=yes|2009|02|20}}
 
 would it? :-)
 

That's the one.

 A quick bit of timing suggests that:
 
if line.lstrip().startswith(}}):
  pass
 
 is what we actually want.
 

Indeed.  Thanks.

--
http://mail.python.org/mailman/listinfo/python-list


Re: extract Infobox contents

2009-04-07 Thread J. Clifford Dyer
On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote:
 On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain  
 anishchapag...@gmail.com wrote:
 
  Hi,
  I was trying to extract wikipedia Infobox contents which is in format
  like given below, from the opened URL page in Python.
 
  {{ Infobox Software
  | name   = Bash
  | logo   = [[Image:bash-org.png|165px]]
  | screenshot = [[Image:Bash demo.png|250px]]
  | caption= Screenshot of bash and [[Bourne shell|sh]]
  sessions demonstrating some features
  | developer  = [[Chet Ramey]]
  | latest release version = 4.0
  | latest release date= {{release date|mf=yes|2009|02|20}}
  | programming language   = [[C (programming language)|C]]
  | operating system   = [[Cross-platform]]
  | platform   = [[GNU]]
  | language   = English, multilingual ([[gettext]])
  | status = Active
  | genre  = [[Unix shell]]
  | source model   = [[Free software]]
  | license= [[GNU General Public License]]
  | website= [http://tiswww.case.edu/php/chet/bash/
  bashtop.html Home page]
  }} //upto this line
 
  I need to extract all data between {{ Infobox ...to }}
 
  Thank's if anyone can help,
  am trying with
 
  s1='{{ Infobox'
  s2=len(s1)
  pos1=data.find({{ Infobox)
  pos2=data.find(\n,pos2)
 
  pat1=data.find(}})
 
  but am ending up getting one line at top only.
 
 How are you getting your data?  Assuming that you can arrange to get
 it one line at a time, here's a quick and dirty way to extract the
 infoboxes on a page.
 
 infoboxes = []
 infobox = []
 reading_infobox = False
 
 for line in feed_me_lines_somehow():
  if line.startswith({{ Infobox):
  reading_infobox = True
  if reading_infobox:
  infobox.append(line)
  if line.startswith(}}):
  reading_infobox = False
  infoboxes.append(infobox)
   infobox = []
 
 You end up with 'infoboxes' containing a list of all the infoboxes
 on the page, each held as a list of the lines of their content.
 For safety's sake you really should be using regular expressions
 rather than 'startswith', but I leave that as an exercise for the
 reader :-)
 

I agree that startswith isn't the right option, but for matching two
constant characters, I don't think re is necessary.  I'd just do:

if '}}' in line:
pass

Then, as the saying goes, you only have one problem.

Cheers,
Cliff


--
http://mail.python.org/mailman/listinfo/python-list


Re: extract Infobox contents

2009-04-07 Thread Rhodri James
On Tue, 07 Apr 2009 12:46:18 +0100, J. Clifford Dyer  
j...@sdf.lonestar.org wrote:



On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote:

On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain
anishchapag...@gmail.com wrote:

 Hi,
 I was trying to extract wikipedia Infobox contents which is in format
 like given below, from the opened URL page in Python.

 {{ Infobox Software
 | name   = Bash

[snip]

 | latest release date= {{release date|mf=yes|2009|02|20}}
 | programming language   = [[C (programming language)|C]]
 | operating system   = [[Cross-platform]]
 | platform   = [[GNU]]
 | language   = English, multilingual ([[gettext]])
 | status = Active

[snip some more]

 }} //upto this line

 I need to extract all data between {{ Infobox ...to }}


[snip still more]


You end up with 'infoboxes' containing a list of all the infoboxes
on the page, each held as a list of the lines of their content.
For safety's sake you really should be using regular expressions
rather than 'startswith', but I leave that as an exercise for the
reader :-)



I agree that startswith isn't the right option, but for matching two
constant characters, I don't think re is necessary.  I'd just do:

if '}}' in line:
pass

Then, as the saying goes, you only have one problem.


That would be the problem of matching lines like:

 | latest release date= {{release date|mf=yes|2009|02|20}}

would it? :-)

A quick bit of timing suggests that:

  if line.lstrip().startswith(}}):
pass

is what we actually want.

--
Rhodri James *-* Wildebeeste Herder to the Masses
--
http://mail.python.org/mailman/listinfo/python-list


extract Infobox contents

2009-04-06 Thread Anish Chapagain
Hi,
I was trying to extract wikipedia Infobox contents which is in format
like given below, from the opened URL page in Python.

{{ Infobox Software
| name   = Bash
| logo   = [[Image:bash-org.png|165px]]
| screenshot = [[Image:Bash demo.png|250px]]
| caption= Screenshot of bash and [[Bourne shell|sh]]
sessions demonstrating some features
| developer  = [[Chet Ramey]]
| latest release version = 4.0
| latest release date= {{release date|mf=yes|2009|02|20}}
| programming language   = [[C (programming language)|C]]
| operating system   = [[Cross-platform]]
| platform   = [[GNU]]
| language   = English, multilingual ([[gettext]])
| status = Active
| genre  = [[Unix shell]]
| source model   = [[Free software]]
| license= [[GNU General Public License]]
| website= [http://tiswww.case.edu/php/chet/bash/
bashtop.html Home page]
}} //upto this line

I need to extract all data between {{ Infobox ...to }}

Thank's if anyone can help,
am trying with

s1='{{ Infobox'
s2=len(s1)
pos1=data.find({{ Infobox)
pos2=data.find(\n,pos2)

pat1=data.find(}})

but am ending up getting one line at top only.

thank you,
--
http://mail.python.org/mailman/listinfo/python-list


Re: extract Infobox contents

2009-04-06 Thread Rhodri James
On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain  
anishchapag...@gmail.com wrote:



Hi,
I was trying to extract wikipedia Infobox contents which is in format
like given below, from the opened URL page in Python.

{{ Infobox Software
| name   = Bash
| logo   = [[Image:bash-org.png|165px]]
| screenshot = [[Image:Bash demo.png|250px]]
| caption= Screenshot of bash and [[Bourne shell|sh]]
sessions demonstrating some features
| developer  = [[Chet Ramey]]
| latest release version = 4.0
| latest release date= {{release date|mf=yes|2009|02|20}}
| programming language   = [[C (programming language)|C]]
| operating system   = [[Cross-platform]]
| platform   = [[GNU]]
| language   = English, multilingual ([[gettext]])
| status = Active
| genre  = [[Unix shell]]
| source model   = [[Free software]]
| license= [[GNU General Public License]]
| website= [http://tiswww.case.edu/php/chet/bash/
bashtop.html Home page]
}} //upto this line

I need to extract all data between {{ Infobox ...to }}

Thank's if anyone can help,
am trying with

s1='{{ Infobox'
s2=len(s1)
pos1=data.find({{ Infobox)
pos2=data.find(\n,pos2)

pat1=data.find(}})

but am ending up getting one line at top only.


How are you getting your data?  Assuming that you can arrange to get
it one line at a time, here's a quick and dirty way to extract the
infoboxes on a page.

infoboxes = []
infobox = []
reading_infobox = False

for line in feed_me_lines_somehow():
if line.startswith({{ Infobox):
reading_infobox = True
if reading_infobox:
infobox.append(line)
if line.startswith(}}):
reading_infobox = False
infoboxes.append(infobox)
infobox = []

You end up with 'infoboxes' containing a list of all the infoboxes
on the page, each held as a list of the lines of their content.
For safety's sake you really should be using regular expressions
rather than 'startswith', but I leave that as an exercise for the
reader :-)

--
Rhodri James *-* Wildebeeste Herder to the Masses
--
http://mail.python.org/mailman/listinfo/python-list