Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Aryeh Gregor
On Fri, Oct 23, 2009 at 12:20 PM, Andrew Dunbar wrote: > Yes I didn't specify tl_namespace In MySQL that will usually make it impossible to effectively use an index on (tl_namespace, tl_title), so it's essential that you specify the NS. (Which you should anyway to avoid hitting things like [[Tem

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Robert Rohde
On Fri, Oct 23, 2009 at 7:04 AM, Roan Kattouw wrote: > 2009/10/23 Robert Rohde : >> Given the fairly obvious utility for data mining, it might make sense >> for someone to extend the Mediawiki API to generate a list of template >> calls and the parameters sent in each case. >> > We had a discussio

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Jona Christopher Sahnwaldt
Note: the trailing "}" is part of the URL. Some mail readers may cut it off. On Fri, Oct 23, 2009 at 18:45, Jona Christopher Sahnwaldt wrote: > Because of result count restrictions, these queries don't > return all ISO language codes extracted by DBpedia, > but I think they give a good impression

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Jona Christopher Sahnwaldt
Because of result count restrictions, these queries don't return all ISO language codes extracted by DBpedia, but I think they give a good impression of the data quality and coverage (or sometimes lack thereof): http://dbpedia.org/sparql?query=select+distinct+%3Fs%2C+%3Fo+where{%3Fs+%3Chttp%3A%2F%

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Andrew Dunbar
2009/10/23 Aryeh Gregor : > On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar wrote: >> Yes I found how to get it through the API now. It was actually just >> the Toolserver database that was intractably slow. > > There's nothing slow about the TS database here: > > mysql> pager true > PAGER set to '

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread David Gerard
2009/10/23 William Pietri : > George Herbert wrote: >> This discussion brings to mind several historical threads. >> I wonder if a project to simply mine the whole article contents and >> provide a DB of some sort with the articles and infobox contents would >> be worthwhile.  Develop a specific p

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Daniel Schwen
Fascinating! It seems to be a repeating pattern on these mailing lists that people ignore existing solutions and discuss re-inventing wheels (please correct me if I'm wrong here). While I agree this is fun some it rarely helps the OP... [[User:Dschwen]] ___

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Aryeh Gregor
On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar wrote: > Yes I found how to get it through the API now. It was actually just > the Toolserver database that was intractably slow. There's nothing slow about the TS database here: mysql> pager true PAGER set to 'true' mysql> SELECT tl_from FROM templ

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Neil Harris
Robert Ullmann wrote: >> I've been spending hours on the parsing now and don't find it simple >> at all due to the fact that templates can be nested. Just extracting >> the Infobox as one big lump is hard due to the need to match nested {{ >> and }} >> >> Andrew Dunbar (hippietrail) >> > > Hi,

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Roan Kattouw
2009/10/23 Robert Rohde : > Given the fairly obvious utility for data mining, it might make sense > for someone to extend the Mediawiki API to generate a list of template > calls and the parameters sent in each case. > We had a discussion about this Tuesday in the tech staff meeting, and decided th

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Andrew Dunbar
2009/10/23 Robert Ullmann : >> I've been spending hours on the parsing now and don't find it simple >> at all due to the fact that templates can be nested. Just extracting >> the Infobox as one big lump is hard due to the need to match nested {{ >> and }} >> >> Andrew Dunbar (hippietrail) > > Hi, >

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Robert Ullmann
> I've been spending hours on the parsing now and don't find it simple > at all due to the fact that templates can be nested. Just extracting > the Infobox as one big lump is hard due to the need to match nested {{ > and }} > > Andrew Dunbar (hippietrail) Hi, Come now, you are over-thinking it. F

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Magnus Manske
I am so glad that someone re-re-resurrects this topic :-) On Fri, Oct 23, 2009 at 1:27 PM, Andrew Dunbar wrote: > I've been spending hours on the parsing now and don't find it simple > at all due to the fact that templates can be nested. Just extracting > the Infobox as one big lump is hard due

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Robert Rohde
Given the fairly obvious utility for data mining, it might make sense for someone to extend the Mediawiki API to generate a list of template calls and the parameters sent in each case. -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Andrew Dunbar
2009/10/23 Robert Ullmann : > Hi Hippietrail! > > What do you mean by "intractably slow"? Just how fast must it be? > > If I do > http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Infobox_Language&eilimit=100&einamespace=0 > it says (on one given try) that it was serve

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Daniel Schwen
> I wonder if a project to simply mine the whole article contents and > provide a DB of some sort with the articles and infobox contents would > be worthwhile.  Develop a specific parser and generate and publish the > complete set of article-infobox-(key-value) sets... That is a brilliant idea...

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Robert Ullmann
Hi Hippietrail! What do you mean by "intractably slow"? Just how fast must it be? If I do http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Infobox_Language&eilimit=100&einamespace=0 it says (on one given try) that it was served in 0,047 seconds. How long can it take

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Jona Christopher Sahnwaldt
On Fri, Oct 23, 2009 at 08:37, George Herbert wrote: > I wonder if a project to simply mine the whole article contents and > provide a DB of some sort with the articles and infobox contents would > be worthwhile.  Develop a specific parser and generate and publish the > complete set of article-inf

Re: [Wikitech-l] Questions: parser functions, article caching

2009-10-23 Thread Roan Kattouw
2009/10/23 Dmitriy Sintsov : > 2. My extension generates dynamical content. Because of that, I use > $parser->disableCache() in my tag parser hook. But, the dynamical > content is being changed only in two cases: > > a) The user edits the page. In such case, disableCache() is not > required, becaus

Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Roan Kattouw
2009/10/23 Andrew Dunbar : > But my attempts to find such pages using either the Toolserver's > Wikipedia database or the Mediawiki API have not been fruitful. In > particular, SQL queries on the templatelinks table are intractably > slow. Why are there no keys on tl_from or tl_title? > There are:

[Wikitech-l] Questions: parser functions, article caching

2009-10-23 Thread Dmitriy Sintsov
Hi! I've made significant cleanup and restructurization of my extension's code that I'd like to submit to SVN. Before trying to submit my extension, I'd like to ask two important questions related to Parser and Article cache. 1. I've implemented my own parser function. The description of functi