Re: [Analytics] Completeness of Wikipedia Clickstream dataset

2019-05-13 Thread Joseph Allemandou
Adding Simon Back as he might not be in the list. On Mon, May 13, 2019 at 5:58 PM Joseph Allemandou wrote: > Hi Simon, > Thanks for reaching out :) > > I tried a similar analysis on our cluster with the same original files as > the ones in dumps.wikimedia.org, using Spark to speed up

Re: [Analytics] Completeness of Wikipedia Clickstream dataset

2019-05-13 Thread Joseph Allemandou
Hi Simon, Thanks for reaching out :) I tried a similar analysis on our cluster with the same original files as the ones in dumps.wikimedia.org, using Spark to speed up computation. I ended up with coherent results for both the examples you gave: Sum - count Data date Climate_change -->

Re: [Analytics] Completeness of Wikipedia Clickstream dataset

2019-05-13 Thread Isaac Johnson
Hey Simon, Is it possible that there is an issue in how you are aggregating the data? I downloaded the dump for 2018-10 ( https://dumps.wikimedia.org/other/clickstream/2018-10/clickstream-enwiki-2018-10.tsv.gz) and found the following lines: Global_warmingClimate_changelink3048

[Analytics] Completeness of Wikipedia Clickstream dataset

2019-05-13 Thread Simon Munzert
Hi all, I've got a question on the completeness of the clickstream dataset. I downloaded the dumps for 2018 from https://dumps.wikimedia.org/other/clickstream/ (English Wikipedia only). When I filter for the article pair "Climate change" and "Global warming" (either one being either prev or

Re: [Analytics] WMF API update

2019-05-13 Thread Celeste A Manughian-Peter
Dan, Apologies for the late reply. I understand, we're mostly just interested in en.wikipedia, not segmented by any year. Our work involves building a tool that analyzes trending technologies, with Wikipedia as one source of data. Best, Celeste From: Dan