Re: extract Infobox contents
On Wed, 2009-04-08 at 01:57 +0100, Rhodri James wrote: On Tue, 07 Apr 2009 12:46:18 +0100, J. Clifford Dyer j...@sdf.lonestar.org wrote: On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote: On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain anishchapag...@gmail.com wrote: Hi, I was trying to extract wikipedia Infobox contents which is in format like given below, from the opened URL page in Python. {{ Infobox Software | name = Bash [snip] | latest release date= {{release date|mf=yes|2009|02|20}} | programming language = [[C (programming language)|C]] | operating system = [[Cross-platform]] | platform = [[GNU]] | language = English, multilingual ([[gettext]]) | status = Active [snip some more] }} //upto this line I need to extract all data between {{ Infobox ...to }} [snip still more] You end up with 'infoboxes' containing a list of all the infoboxes on the page, each held as a list of the lines of their content. For safety's sake you really should be using regular expressions rather than 'startswith', but I leave that as an exercise for the reader :-) I agree that startswith isn't the right option, but for matching two constant characters, I don't think re is necessary. I'd just do: if '}}' in line: pass Then, as the saying goes, you only have one problem. That would be the problem of matching lines like: | latest release date= {{release date|mf=yes|2009|02|20}} would it? :-) That's the one. A quick bit of timing suggests that: if line.lstrip().startswith(}}): pass is what we actually want. Indeed. Thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: extract Infobox contents
On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote: On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain anishchapag...@gmail.com wrote: Hi, I was trying to extract wikipedia Infobox contents which is in format like given below, from the opened URL page in Python. {{ Infobox Software | name = Bash | logo = [[Image:bash-org.png|165px]] | screenshot = [[Image:Bash demo.png|250px]] | caption= Screenshot of bash and [[Bourne shell|sh]] sessions demonstrating some features | developer = [[Chet Ramey]] | latest release version = 4.0 | latest release date= {{release date|mf=yes|2009|02|20}} | programming language = [[C (programming language)|C]] | operating system = [[Cross-platform]] | platform = [[GNU]] | language = English, multilingual ([[gettext]]) | status = Active | genre = [[Unix shell]] | source model = [[Free software]] | license= [[GNU General Public License]] | website= [http://tiswww.case.edu/php/chet/bash/ bashtop.html Home page] }} //upto this line I need to extract all data between {{ Infobox ...to }} Thank's if anyone can help, am trying with s1='{{ Infobox' s2=len(s1) pos1=data.find({{ Infobox) pos2=data.find(\n,pos2) pat1=data.find(}}) but am ending up getting one line at top only. How are you getting your data? Assuming that you can arrange to get it one line at a time, here's a quick and dirty way to extract the infoboxes on a page. infoboxes = [] infobox = [] reading_infobox = False for line in feed_me_lines_somehow(): if line.startswith({{ Infobox): reading_infobox = True if reading_infobox: infobox.append(line) if line.startswith(}}): reading_infobox = False infoboxes.append(infobox) infobox = [] You end up with 'infoboxes' containing a list of all the infoboxes on the page, each held as a list of the lines of their content. For safety's sake you really should be using regular expressions rather than 'startswith', but I leave that as an exercise for the reader :-) I agree that startswith isn't the right option, but for matching two constant characters, I don't think re is necessary. I'd just do: if '}}' in line: pass Then, as the saying goes, you only have one problem. Cheers, Cliff -- http://mail.python.org/mailman/listinfo/python-list
Re: extract Infobox contents
On Tue, 07 Apr 2009 12:46:18 +0100, J. Clifford Dyer j...@sdf.lonestar.org wrote: On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote: On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain anishchapag...@gmail.com wrote: Hi, I was trying to extract wikipedia Infobox contents which is in format like given below, from the opened URL page in Python. {{ Infobox Software | name = Bash [snip] | latest release date= {{release date|mf=yes|2009|02|20}} | programming language = [[C (programming language)|C]] | operating system = [[Cross-platform]] | platform = [[GNU]] | language = English, multilingual ([[gettext]]) | status = Active [snip some more] }} //upto this line I need to extract all data between {{ Infobox ...to }} [snip still more] You end up with 'infoboxes' containing a list of all the infoboxes on the page, each held as a list of the lines of their content. For safety's sake you really should be using regular expressions rather than 'startswith', but I leave that as an exercise for the reader :-) I agree that startswith isn't the right option, but for matching two constant characters, I don't think re is necessary. I'd just do: if '}}' in line: pass Then, as the saying goes, you only have one problem. That would be the problem of matching lines like: | latest release date= {{release date|mf=yes|2009|02|20}} would it? :-) A quick bit of timing suggests that: if line.lstrip().startswith(}}): pass is what we actually want. -- Rhodri James *-* Wildebeeste Herder to the Masses -- http://mail.python.org/mailman/listinfo/python-list
extract Infobox contents
Hi, I was trying to extract wikipedia Infobox contents which is in format like given below, from the opened URL page in Python. {{ Infobox Software | name = Bash | logo = [[Image:bash-org.png|165px]] | screenshot = [[Image:Bash demo.png|250px]] | caption= Screenshot of bash and [[Bourne shell|sh]] sessions demonstrating some features | developer = [[Chet Ramey]] | latest release version = 4.0 | latest release date= {{release date|mf=yes|2009|02|20}} | programming language = [[C (programming language)|C]] | operating system = [[Cross-platform]] | platform = [[GNU]] | language = English, multilingual ([[gettext]]) | status = Active | genre = [[Unix shell]] | source model = [[Free software]] | license= [[GNU General Public License]] | website= [http://tiswww.case.edu/php/chet/bash/ bashtop.html Home page] }} //upto this line I need to extract all data between {{ Infobox ...to }} Thank's if anyone can help, am trying with s1='{{ Infobox' s2=len(s1) pos1=data.find({{ Infobox) pos2=data.find(\n,pos2) pat1=data.find(}}) but am ending up getting one line at top only. thank you, -- http://mail.python.org/mailman/listinfo/python-list
Re: extract Infobox contents
On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain anishchapag...@gmail.com wrote: Hi, I was trying to extract wikipedia Infobox contents which is in format like given below, from the opened URL page in Python. {{ Infobox Software | name = Bash | logo = [[Image:bash-org.png|165px]] | screenshot = [[Image:Bash demo.png|250px]] | caption= Screenshot of bash and [[Bourne shell|sh]] sessions demonstrating some features | developer = [[Chet Ramey]] | latest release version = 4.0 | latest release date= {{release date|mf=yes|2009|02|20}} | programming language = [[C (programming language)|C]] | operating system = [[Cross-platform]] | platform = [[GNU]] | language = English, multilingual ([[gettext]]) | status = Active | genre = [[Unix shell]] | source model = [[Free software]] | license= [[GNU General Public License]] | website= [http://tiswww.case.edu/php/chet/bash/ bashtop.html Home page] }} //upto this line I need to extract all data between {{ Infobox ...to }} Thank's if anyone can help, am trying with s1='{{ Infobox' s2=len(s1) pos1=data.find({{ Infobox) pos2=data.find(\n,pos2) pat1=data.find(}}) but am ending up getting one line at top only. How are you getting your data? Assuming that you can arrange to get it one line at a time, here's a quick and dirty way to extract the infoboxes on a page. infoboxes = [] infobox = [] reading_infobox = False for line in feed_me_lines_somehow(): if line.startswith({{ Infobox): reading_infobox = True if reading_infobox: infobox.append(line) if line.startswith(}}): reading_infobox = False infoboxes.append(infobox) infobox = [] You end up with 'infoboxes' containing a list of all the infoboxes on the page, each held as a list of the lines of their content. For safety's sake you really should be using regular expressions rather than 'startswith', but I leave that as an exercise for the reader :-) -- Rhodri James *-* Wildebeeste Herder to the Masses -- http://mail.python.org/mailman/listinfo/python-list