[Wiki-research-l] Distributing the Wikipedia category/pagelink graph

2013-12-10 Thread Dario Taraborelli
(cross-posting Sebastiano’s post from the analytics list, this may be of 
interest to both the wikidata and wiki-research-l communities)

Begin forwarded message:

> From: Sebastiano Vigna 
> Subject: [Analytics] Distributing an official graph
> Date: December 9, 2013 at 10:09:31 PM PST
> 
> [Reposted from private discussion after Dario's request]
> 
> My problem is that of exploring the graph structure of Wikipedia
> 
> 1) easily;
> 2) reproducibly;
> 3) in a way that does not depend on parsing artifacts.
> 
> Presently, when people wants to do this they either do their own parsing of 
> the dumps, or they use the SQL data, or they download a dataset like
> 
> http://law.di.unimi.it/webdata/enwiki-2013/
> 
> which has everything "cooked up".
> 
> My frustration in the last few days was when trying to add the category 
> links. I didn't realize (well, it's not very documented) that bliki extracts 
> all links and render them in HTML *except* for the category links, that are 
> instead accessible programmatically. Once I got there, I was able to make 
> some progress.
> 
> Nonetheless, I think that the graph of Wikipedia connections (hyperlinks and 
> category links) is really a mine of information and it is a pity that a lot 
> of huffing and puffing is necessary to do something as simple as a reverse 
> visit of the category links from "People" to get, actually, all people pages 
> (this is a bit more complicated--there are many false positives, but after a 
> couple of fixes worked quite well).
> 
> Moreover, one has continuously this feeling of walking on eggshells: a small 
> change in bliki, a small change in the XML format and everything might stop 
> working is such a subtle manner that you realize it only after a long time.
> 
> I was wondering if Wikimedia would be interested in distributing in 
> compressed form the Wikipedia graph. That would be the "official" Wikipedia 
> graph--the benefits, in particular for people working on leveraging semantic 
> information from Wikipedia, would be really significant.
> 
> I would (obviously) propose to use our Java framework, WebGraph, which is 
> actually quite standard in distributing large (well, actually much larger) 
> graphs, such as ClueWeb09 http://lemurproject.org/clueweb09/, ClueWeb12 
> http://lemurproject.org/clueweb12/ and the recent Common Web Crawl 
> http://webdatacommons.org/hyperlinkgraph/index.html. But any format is OK, 
> even a pair of integers per line. The advantage of a binary compressed form 
> is reduced network utilization, instantaneous availability of the 
> information, etc.
> 
> Probably it would be useful to actually distribute several graphs with the 
> same dataset--e.g., the category links, the content link, etc. It is 
> immediate, using WebGraph, to build a union (i.e., a superposition) of any 
> set of such graphs and use it transparently as a single graph.
> 
> In my mind the distributed graph should have a contiguous ID space, say, 
> induced by the lexicographical order of the titles (possibly placing template 
> pages at the start or at the end of the ID space). We should provide graphs, 
> and a bidirectional node<->title map. All such information would use about 
> 300M of space for the current English Wikipedia. People could then associate 
> pages to nodes using the title as a key.
> 
> But this last part is just rambling. :)
> 
> Let me know if you people are interested. We can of course take care of the 
> process of cooking up the information once it is out of the SQL database.
> 
> Ciao,
> 
>   seba
> 
> 
> ___
> Analytics mailing list
> analyt...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Fwd: Fwd: Data Science for Social Good Summer Fellowship

2013-12-10 Thread Giovanni Luca Ciampaglia
Hi Ahmed, I am not connected with them. You'd better query directly mr. 
Ghani.


Best

Giovanni

On Tue 10 Dec 2013 11:56:35 AM EST, Ahmed Aley wrote:

Hi,

I know a couple of students who might be interested but unfortunately they are
all doing their studies in Europe, with that be an issue?

Best,
--Ahmed


On Tue, Dec 10, 2013 at 5:27 PM, Giovanni Luca Ciampaglia
mailto:gciam...@indiana.edu>> wrote:

Not wiki-related per se, but probably many people on this list might be
interested.

G



*From: *Rayid Ghani mailto:ra...@uchicago.edu>>
*Subject: **Data Science for Social Good Summer Fellowship*
*Date: *December 9, 2013 3:00:10 PM EST
*To: *Rayid Ghani mailto:ra...@uchicago.edu>>

Hi,
I'm running the Eric & Wendy Schmidt "Data Science for Social Good"
Summer Fellowship again this year at the University of Chicago and need
help in recruiting strong students (grad students or junior/senior
undergrads with CS, Machine Learning, and/or Stats background). The goal
is to get up to 50 students in Chicago this summer and have them work on
high-impact social problems (in education, healthcare, energy,
transportation, crime, etc.) using Machine Learning, Data Mining, and
other related buzzwords. The students will work with full-time mentors
from academia and industry. The fellowships are paid competitively and
we will provide housing as well.

More details are at http://dssg.uchicago.edu .
Applications for the fellowship are due February 1, 2014.

If you have (or know of) strong CS/Stats/Econometrics/Applied Math/Policy
students who have an interest in making an impact by working on
high-impact social problems using machine learning/data mining/stats,
please forward this to them.

Thanks,
Rayid

P.S. We’re also looking for full-time mentors (strong technical folks
with real-world experience who want to spend the summer in Chicago
working with a team of fellows).

Rayid Ghani
Computation Institute & Harris School of Public Policy
University of Chicago
ra...@uchicago.edu 
http://www.rayidghani.com 






___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org

https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Giovanni Luca Ciampaglia

Postdoctoral fellow
Center for Complex Networks and Systems Research
Indiana University

✎ 910 E 10th St ∙ Bloomington ∙ IN 47408
☞ http://cnets.indiana.edu/
✉ gciam...@indiana.edu
✆ 1-812-855-7261


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Fwd: Fwd: Data Science for Social Good Summer Fellowship

2013-12-10 Thread Ahmed Aley
Hi,

I know a couple of students who might be interested but unfortunately they
are all doing their studies in Europe, with that be an issue?

Best,
--Ahmed


On Tue, Dec 10, 2013 at 5:27 PM, Giovanni Luca Ciampaglia <
gciam...@indiana.edu> wrote:

>  Not wiki-related per se, but probably many people on this list might be
> interested.
>
> G
>
>
> *From: *Rayid Ghani 
>  *Subject: **Data Science for Social Good Summer Fellowship*
>  *Date: *December 9, 2013 3:00:10 PM EST
>  *To: *Rayid Ghani 
>
>  Hi,
> I'm running the Eric & Wendy Schmidt "Data Science for Social Good" Summer
> Fellowship again this year at the University of Chicago and need help in
> recruiting strong students (grad students or junior/senior undergrads with
> CS, Machine Learning, and/or Stats background). The goal is to get up to
> 50 students in Chicago this summer and have them work on high-impact
> social problems (in education, healthcare, energy, transportation, crime,
> etc.) using Machine Learning, Data Mining, and other related buzzwords. The
> students will work with full-time mentors from academia and industry. The
> fellowships are paid competitively and we will provide housing as well.
>
> More details are at http://dssg.uchicago.edu. Applications for
> the fellowship are due February 1, 2014.
>
> If you have (or know of) strong CS/Stats/Econometrics/Applied Math/Policy
> students who have an interest in making an impact by working on high-impact
> social problems using machine learning/data mining/stats, please forward
> this to them.
>
> Thanks,
> Rayid
>
>  P.S. We’re also looking for full-time mentors (strong technical folks
> with real-world experience who want to spend the summer in Chicago working
> with a team of fellows).
>
>  Rayid Ghani
> Computation Institute & Harris School of Public Policy
> University of Chicago
> ra...@uchicago.edu
> http://www.rayidghani.com
>
>
>
>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Fwd: Fwd: Data Science for Social Good Summer Fellowship

2013-12-10 Thread Giovanni Luca Ciampaglia

Not wiki-related per se, but probably many people on this list might be 
interested.

G



*From: *Rayid Ghani mailto:ra...@uchicago.edu>>
*Subject: **Data Science for Social Good Summer Fellowship*
*Date: *December 9, 2013 3:00:10 PM EST
*To: *Rayid Ghani mailto:ra...@uchicago.edu>>

Hi,
I'm running the Eric & Wendy Schmidt "Data Science for Social Good" Summer 
Fellowship again this year at the University of Chicago and need help in 
recruiting strong students (grad students or junior/senior undergrads with 
CS, Machine Learning, and/or Stats background). The goal is to get up to 
50 students in Chicago this summer and have them work on high-impact 
social problems (in education, healthcare, energy, transportation, crime, 
etc.) using Machine Learning, Data Mining, and other related buzzwords. The 
students will work with full-time mentors from academia and industry. The 
fellowships are paid competitively and we will provide housing as well.


More details are at http://dssg.uchicago.edu . 
Applications for the fellowship are due February 1, 2014.


If you have (or know of) strong CS/Stats/Econometrics/Applied Math/Policy 
students who have an interest in making an impact by working on high-impact 
social problems using machine learning/data mining/stats, please forward this 
to them.


Thanks,
Rayid

P.S. We’re also looking for full-time mentors (strong technical folks with 
real-world experience who want to spend the summer in Chicago working with a 
team of fellows).


Rayid Ghani
Computation Institute & Harris School of Public Policy
University of Chicago
ra...@uchicago.edu 
http://www.rayidghani.com 





___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l