On Mon, Oct 26, 2009 at 02:55, Andrew Dunbar wrote:
> Have you thought about doing the same for Wiktionary?
Interesting idea. I don't know much about Wiktionary.
Are its pages structured similarly? How difficult would
it be to extract structured data from them? What kind
of data would you expect
2009/10/23 Aryeh Gregor :
> On Fri, Oct 23, 2009 at 12:20 PM, Andrew Dunbar wrote:
>> Yes I didn't specify tl_namespace
>
> In MySQL that will usually make it impossible to effectively use an
> index on (tl_namespace, tl_title), so it's essential that you specify
> the NS. (Which you should anywa
2009/10/23 Jona Christopher Sahnwaldt :
> Because of result count restrictions, these queries don't
> return all ISO language codes extracted by DBpedia,
> but I think they give a good impression of the data quality
> and coverage (or sometimes lack thereof):
>
> http://dbpedia.org/sparql?query=sel
On Fri, Oct 23, 2009 at 12:20 PM, Andrew Dunbar wrote:
> Yes I didn't specify tl_namespace
In MySQL that will usually make it impossible to effectively use an
index on (tl_namespace, tl_title), so it's essential that you specify
the NS. (Which you should anyway to avoid hitting things like
[[Tem
On Fri, Oct 23, 2009 at 7:04 AM, Roan Kattouw wrote:
> 2009/10/23 Robert Rohde :
>> Given the fairly obvious utility for data mining, it might make sense
>> for someone to extend the Mediawiki API to generate a list of template
>> calls and the parameters sent in each case.
>>
> We had a discussio
Note: the trailing "}" is part of the URL. Some mail readers may
cut it off.
On Fri, Oct 23, 2009 at 18:45, Jona Christopher Sahnwaldt
wrote:
> Because of result count restrictions, these queries don't
> return all ISO language codes extracted by DBpedia,
> but I think they give a good impression
Because of result count restrictions, these queries don't
return all ISO language codes extracted by DBpedia,
but I think they give a good impression of the data quality
and coverage (or sometimes lack thereof):
http://dbpedia.org/sparql?query=select+distinct+%3Fs%2C+%3Fo+where{%3Fs+%3Chttp%3A%2F%
2009/10/23 Aryeh Gregor :
> On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar wrote:
>> Yes I found how to get it through the API now. It was actually just
>> the Toolserver database that was intractably slow.
>
> There's nothing slow about the TS database here:
>
> mysql> pager true
> PAGER set to '
2009/10/23 William Pietri :
> George Herbert wrote:
>> This discussion brings to mind several historical threads.
>> I wonder if a project to simply mine the whole article contents and
>> provide a DB of some sort with the articles and infobox contents would
>> be worthwhile. Develop a specific p
Fascinating!
It seems to be a repeating pattern on these mailing lists that people
ignore existing solutions and discuss re-inventing wheels (please
correct me if I'm wrong here).
While I agree this is fun some it rarely helps the OP...
[[User:Dschwen]]
___
On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar wrote:
> Yes I found how to get it through the API now. It was actually just
> the Toolserver database that was intractably slow.
There's nothing slow about the TS database here:
mysql> pager true
PAGER set to 'true'
mysql> SELECT tl_from FROM templ
Robert Ullmann wrote:
>> I've been spending hours on the parsing now and don't find it simple
>> at all due to the fact that templates can be nested. Just extracting
>> the Infobox as one big lump is hard due to the need to match nested {{
>> and }}
>>
>> Andrew Dunbar (hippietrail)
>>
>
> Hi,
2009/10/23 Robert Rohde :
> Given the fairly obvious utility for data mining, it might make sense
> for someone to extend the Mediawiki API to generate a list of template
> calls and the parameters sent in each case.
>
We had a discussion about this Tuesday in the tech staff meeting, and
decided th
2009/10/23 Robert Ullmann :
>> I've been spending hours on the parsing now and don't find it simple
>> at all due to the fact that templates can be nested. Just extracting
>> the Infobox as one big lump is hard due to the need to match nested {{
>> and }}
>>
>> Andrew Dunbar (hippietrail)
>
> Hi,
>
> I've been spending hours on the parsing now and don't find it simple
> at all due to the fact that templates can be nested. Just extracting
> the Infobox as one big lump is hard due to the need to match nested {{
> and }}
>
> Andrew Dunbar (hippietrail)
Hi,
Come now, you are over-thinking it. F
I am so glad that someone re-re-resurrects this topic :-)
On Fri, Oct 23, 2009 at 1:27 PM, Andrew Dunbar wrote:
> I've been spending hours on the parsing now and don't find it simple
> at all due to the fact that templates can be nested. Just extracting
> the Infobox as one big lump is hard due
Given the fairly obvious utility for data mining, it might make sense
for someone to extend the Mediawiki API to generate a list of template
calls and the parameters sent in each case.
-Robert Rohde
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia
2009/10/23 Robert Ullmann :
> Hi Hippietrail!
>
> What do you mean by "intractably slow"? Just how fast must it be?
>
> If I do
> http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Infobox_Language&eilimit=100&einamespace=0
> it says (on one given try) that it was serve
> I wonder if a project to simply mine the whole article contents and
> provide a DB of some sort with the articles and infobox contents would
> be worthwhile. Develop a specific parser and generate and publish the
> complete set of article-infobox-(key-value) sets...
That is a brilliant idea...
Hi Hippietrail!
What do you mean by "intractably slow"? Just how fast must it be?
If I do
http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Infobox_Language&eilimit=100&einamespace=0
it says (on one given try) that it was served in 0,047 seconds. How
long can it take
On Fri, Oct 23, 2009 at 08:37, George Herbert wrote:
> I wonder if a project to simply mine the whole article contents and
> provide a DB of some sort with the articles and infobox contents would
> be worthwhile. Develop a specific parser and generate and publish the
> complete set of article-inf
2009/10/23 Andrew Dunbar :
> But my attempts to find such pages using either the Toolserver's
> Wikipedia database or the Mediawiki API have not been fruitful. In
> particular, SQL queries on the templatelinks table are intractably
> slow. Why are there no keys on tl_from or tl_title?
>
There are:
George Herbert wrote:
> This discussion brings to mind several historical threads.
>
> I wonder if a project to simply mine the whole article contents and
> provide a DB of some sort with the articles and infobox contents would
> be worthwhile. Develop a specific parser and generate and publish th
This discussion brings to mind several historical threads.
I wonder if a project to simply mine the whole article contents and
provide a DB of some sort with the articles and infobox contents would
be worthwhile. Develop a specific parser and generate and publish the
complete set of article-infob
2009/10/22 Daniel Schwen :
>> particular, SQL queries on the templatelinks table are intractably
>> slow. Why are there no keys on tl_from or tl_title?
>
> How are you planning to get the template parameters? Have I missed a
> recent schema change?
I've been trying to parse the wikitext of section
> particular, SQL queries on the templatelinks table are intractably
> slow. Why are there no keys on tl_from or tl_title?
How are you planning to get the template parameters? Have I missed a
recent schema change?
I'd be interested in following your progress. I'm not extracting
infobox data, but p
Infoboxes in Wikipedia often contain information which is quite useful
outside Wikipedia but can be surprisingly difficult to data-mine.
I would like to find all Wikipedia pages that use
Template:Infobox_Language and parse the parameters iso3 and
fam1...fam15
But my attempts to find such pages us
27 matches
Mail list logo