Re: [Wikitech-l] Datamining infoboxes

2009-10-26 Thread Jona Christopher Sahnwaldt
On Mon, Oct 26, 2009 at 02:55, Andrew Dunbar wrote: > Have you thought about doing the same for Wiktionary? Interesting idea. I don't know much about Wiktionary. Are its pages structured similarly? How difficult would it be to extract structured data from them? What kind of data would you expect

Re: [Wikitech-l] Datamining infoboxes

2009-10-25 Thread Andrew Dunbar
2009/10/23 Aryeh Gregor : > On Fri, Oct 23, 2009 at 12:20 PM, Andrew Dunbar wrote: >> Yes I didn't specify tl_namespace > > In MySQL that will usually make it impossible to effectively use an > index on (tl_namespace, tl_title), so it's essential that you specify > the NS.  (Which you should anywa

Re: [Wikitech-l] Datamining infoboxes

2009-10-25 Thread Andrew Dunbar
2009/10/23 Jona Christopher Sahnwaldt : > Because of result count restrictions, these queries don't > return all ISO language codes extracted by DBpedia, > but I think they give a good impression of the data quality > and coverage (or sometimes lack thereof): > > http://dbpedia.org/sparql?query=sel

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Aryeh Gregor
On Fri, Oct 23, 2009 at 12:20 PM, Andrew Dunbar wrote: > Yes I didn't specify tl_namespace In MySQL that will usually make it impossible to effectively use an index on (tl_namespace, tl_title), so it's essential that you specify the NS. (Which you should anyway to avoid hitting things like [[Tem

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Robert Rohde
On Fri, Oct 23, 2009 at 7:04 AM, Roan Kattouw wrote: > 2009/10/23 Robert Rohde : >> Given the fairly obvious utility for data mining, it might make sense >> for someone to extend the Mediawiki API to generate a list of template >> calls and the parameters sent in each case. >> > We had a discussio

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Jona Christopher Sahnwaldt
Note: the trailing "}" is part of the URL. Some mail readers may cut it off. On Fri, Oct 23, 2009 at 18:45, Jona Christopher Sahnwaldt wrote: > Because of result count restrictions, these queries don't > return all ISO language codes extracted by DBpedia, > but I think they give a good impression

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Jona Christopher Sahnwaldt
Because of result count restrictions, these queries don't return all ISO language codes extracted by DBpedia, but I think they give a good impression of the data quality and coverage (or sometimes lack thereof): http://dbpedia.org/sparql?query=select+distinct+%3Fs%2C+%3Fo+where{%3Fs+%3Chttp%3A%2F%

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Andrew Dunbar
2009/10/23 Aryeh Gregor : > On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar wrote: >> Yes I found how to get it through the API now. It was actually just >> the Toolserver database that was intractably slow. > > There's nothing slow about the TS database here: > > mysql> pager true > PAGER set to '

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread David Gerard
2009/10/23 William Pietri : > George Herbert wrote: >> This discussion brings to mind several historical threads. >> I wonder if a project to simply mine the whole article contents and >> provide a DB of some sort with the articles and infobox contents would >> be worthwhile.  Develop a specific p

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Daniel Schwen
Fascinating! It seems to be a repeating pattern on these mailing lists that people ignore existing solutions and discuss re-inventing wheels (please correct me if I'm wrong here). While I agree this is fun some it rarely helps the OP... [[User:Dschwen]] ___

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Aryeh Gregor
On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar wrote: > Yes I found how to get it through the API now. It was actually just > the Toolserver database that was intractably slow. There's nothing slow about the TS database here: mysql> pager true PAGER set to 'true' mysql> SELECT tl_from FROM templ

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Neil Harris
Robert Ullmann wrote: >> I've been spending hours on the parsing now and don't find it simple >> at all due to the fact that templates can be nested. Just extracting >> the Infobox as one big lump is hard due to the need to match nested {{ >> and }} >> >> Andrew Dunbar (hippietrail) >> > > Hi,

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Roan Kattouw
2009/10/23 Robert Rohde : > Given the fairly obvious utility for data mining, it might make sense > for someone to extend the Mediawiki API to generate a list of template > calls and the parameters sent in each case. > We had a discussion about this Tuesday in the tech staff meeting, and decided th

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Andrew Dunbar
2009/10/23 Robert Ullmann : >> I've been spending hours on the parsing now and don't find it simple >> at all due to the fact that templates can be nested. Just extracting >> the Infobox as one big lump is hard due to the need to match nested {{ >> and }} >> >> Andrew Dunbar (hippietrail) > > Hi, >

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Robert Ullmann
> I've been spending hours on the parsing now and don't find it simple > at all due to the fact that templates can be nested. Just extracting > the Infobox as one big lump is hard due to the need to match nested {{ > and }} > > Andrew Dunbar (hippietrail) Hi, Come now, you are over-thinking it. F

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Magnus Manske
I am so glad that someone re-re-resurrects this topic :-) On Fri, Oct 23, 2009 at 1:27 PM, Andrew Dunbar wrote: > I've been spending hours on the parsing now and don't find it simple > at all due to the fact that templates can be nested. Just extracting > the Infobox as one big lump is hard due

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Robert Rohde
Given the fairly obvious utility for data mining, it might make sense for someone to extend the Mediawiki API to generate a list of template calls and the parameters sent in each case. -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Andrew Dunbar
2009/10/23 Robert Ullmann : > Hi Hippietrail! > > What do you mean by "intractably slow"? Just how fast must it be? > > If I do > http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Infobox_Language&eilimit=100&einamespace=0 > it says (on one given try) that it was serve

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Daniel Schwen
> I wonder if a project to simply mine the whole article contents and > provide a DB of some sort with the articles and infobox contents would > be worthwhile.  Develop a specific parser and generate and publish the > complete set of article-infobox-(key-value) sets... That is a brilliant idea...

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Robert Ullmann
Hi Hippietrail! What do you mean by "intractably slow"? Just how fast must it be? If I do http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Infobox_Language&eilimit=100&einamespace=0 it says (on one given try) that it was served in 0,047 seconds. How long can it take

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Jona Christopher Sahnwaldt
On Fri, Oct 23, 2009 at 08:37, George Herbert wrote: > I wonder if a project to simply mine the whole article contents and > provide a DB of some sort with the articles and infobox contents would > be worthwhile.  Develop a specific parser and generate and publish the > complete set of article-inf

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Roan Kattouw
2009/10/23 Andrew Dunbar : > But my attempts to find such pages using either the Toolserver's > Wikipedia database or the Mediawiki API have not been fruitful. In > particular, SQL queries on the templatelinks table are intractably > slow. Why are there no keys on tl_from or tl_title? > There are:

Re: [Wikitech-l] Datamining infoboxes

2009-10-22 Thread William Pietri
George Herbert wrote: > This discussion brings to mind several historical threads. > > I wonder if a project to simply mine the whole article contents and > provide a DB of some sort with the articles and infobox contents would > be worthwhile. Develop a specific parser and generate and publish th

Re: [Wikitech-l] Datamining infoboxes

2009-10-22 Thread George Herbert
This discussion brings to mind several historical threads. I wonder if a project to simply mine the whole article contents and provide a DB of some sort with the articles and infobox contents would be worthwhile. Develop a specific parser and generate and publish the complete set of article-infob

Re: [Wikitech-l] Datamining infoboxes

2009-10-22 Thread Andrew Dunbar
2009/10/22 Daniel Schwen : >> particular, SQL queries on the templatelinks table are intractably >> slow. Why are there no keys on tl_from or tl_title? > > How are you planning to get the template parameters? Have I missed a > recent schema change? I've been trying to parse the wikitext of section

Re: [Wikitech-l] Datamining infoboxes

2009-10-22 Thread Daniel Schwen
> particular, SQL queries on the templatelinks table are intractably > slow. Why are there no keys on tl_from or tl_title? How are you planning to get the template parameters? Have I missed a recent schema change? I'd be interested in following your progress. I'm not extracting infobox data, but p

[Wikitech-l] Datamining infoboxes

2009-10-22 Thread Andrew Dunbar
Infoboxes in Wikipedia often contain information which is quite useful outside Wikipedia but can be surprisingly difficult to data-mine. I would like to find all Wikipedia pages that use Template:Infobox_Language and parse the parameters iso3 and fam1...fam15 But my attempts to find such pages us