(cross-posting Sebastiano’s post from the analytics list, this may be of 
interest to both the wikidata and wiki-research-l communities)

Begin forwarded message:

> From: Sebastiano Vigna <vi...@di.unimi.it>
> Subject: [Analytics] Distributing an official graph
> Date: December 9, 2013 at 10:09:31 PM PST
> 
> [Reposted from private discussion after Dario's request]
> 
> My problem is that of exploring the graph structure of Wikipedia
> 
> 1) easily;
> 2) reproducibly;
> 3) in a way that does not depend on parsing artifacts.
> 
> Presently, when people wants to do this they either do their own parsing of 
> the dumps, or they use the SQL data, or they download a dataset like
> 
> http://law.di.unimi.it/webdata/enwiki-2013/
> 
> which has everything "cooked up".
> 
> My frustration in the last few days was when trying to add the category 
> links. I didn't realize (well, it's not very documented) that bliki extracts 
> all links and render them in HTML *except* for the category links, that are 
> instead accessible programmatically. Once I got there, I was able to make 
> some progress.
> 
> Nonetheless, I think that the graph of Wikipedia connections (hyperlinks and 
> category links) is really a mine of information and it is a pity that a lot 
> of huffing and puffing is necessary to do something as simple as a reverse 
> visit of the category links from "People" to get, actually, all people pages 
> (this is a bit more complicated--there are many false positives, but after a 
> couple of fixes worked quite well).
> 
> Moreover, one has continuously this feeling of walking on eggshells: a small 
> change in bliki, a small change in the XML format and everything might stop 
> working is such a subtle manner that you realize it only after a long time.
> 
> I was wondering if Wikimedia would be interested in distributing in 
> compressed form the Wikipedia graph. That would be the "official" Wikipedia 
> graph--the benefits, in particular for people working on leveraging semantic 
> information from Wikipedia, would be really significant.
> 
> I would (obviously) propose to use our Java framework, WebGraph, which is 
> actually quite standard in distributing large (well, actually much larger) 
> graphs, such as ClueWeb09 http://lemurproject.org/clueweb09/, ClueWeb12 
> http://lemurproject.org/clueweb12/ and the recent Common Web Crawl 
> http://webdatacommons.org/hyperlinkgraph/index.html. But any format is OK, 
> even a pair of integers per line. The advantage of a binary compressed form 
> is reduced network utilization, instantaneous availability of the 
> information, etc.
> 
> Probably it would be useful to actually distribute several graphs with the 
> same dataset--e.g., the category links, the content link, etc. It is 
> immediate, using WebGraph, to build a union (i.e., a superposition) of any 
> set of such graphs and use it transparently as a single graph.
> 
> In my mind the distributed graph should have a contiguous ID space, say, 
> induced by the lexicographical order of the titles (possibly placing template 
> pages at the start or at the end of the ID space). We should provide graphs, 
> and a bidirectional node<->title map. All such information would use about 
> 300M of space for the current English Wikipedia. People could then associate 
> pages to nodes using the title as a key.
> 
> But this last part is just rambling. :)
> 
> Let me know if you people are interested. We can of course take care of the 
> process of cooking up the information once it is out of the SQL database.
> 
> Ciao,
> 
>                                       seba
> 
> 
> _______________________________________________
> Analytics mailing list
> analyt...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to