While we (Research & Data @ WMF) consider maintaining a standard dump of categories and pagelinks, anyone can pull such a dataset from tool labs slaves (see http://tools.wmflabs.org) with the following two queries.
*/* Get all page links */* SELECT origin.page_id AS from_id, origin.page_namespace AS from_namespace, origin.page_title AS from_title, dest.page_id AS to_id, */* NULL if page doesn't exists */ * pl_namespace AS to_namespace, pl_title AS to_title FROM pagelinks LEFT JOIN page origin ON origin.page_id = pl_from LEFT JOIN page dest ON dest.page_namespace = pl_namespace AND dest.page_title = pl_title; */* Get all category links */* SELECT origin.page_id AS from_id, origin.page_namespace AS from_namespace, origin.page_title AS from_title, cl_to AS category_title FROM categorylinks LEFT JOIN page origin ON page_id = cl_from; Note that these tables are very large. For English Wikipedia, pagelinkscontains ~900 million rows and categorylinks contains ~66 million rows. -Aaron On Mon, Dec 16, 2013 at 11:28 AM, Giovanni Luca Ciampaglia < glciamp...@gmail.com> wrote: > +1 > > Same here, also, using a standardized dataset would make much easier to > reproduce others' work. > > G > > > On Sun 15 Dec 2013 05:19:54 AM EST, Carlos Castillo wrote: > >> Hi, >> >> I think this is definitively a great idea which will save lots of >> researchers a ton of work. >> >> Cheers, >> >> > > > -- > Giovanni Luca Ciampaglia > > Postdoctoral fellow > Center for Complex Networks and Systems Research > Indiana University > > ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 > ☞ http://cnets.indiana.edu/ > ✉ gciam...@indiana.edu > ✆ 1-812-855-7261 > > > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l