Adding Simon Back as he might not be in the list.
On Mon, May 13, 2019 at 5:58 PM Joseph Allemandou
wrote:
> Hi Simon,
> Thanks for reaching out :)
>
> I tried a similar analysis on our cluster with the same original files as
> the ones in dumps.wikimedia.org, using Spark to speed up
Hi Simon,
Thanks for reaching out :)
I tried a similar analysis on our cluster with the same original files as
the ones in dumps.wikimedia.org, using Spark to speed up computation.
I ended up with coherent results for both the examples you gave:
Sum - count Data
date Climate_change -->
Hey Simon,
Is it possible that there is an issue in how you are aggregating the data?
I downloaded the dump for 2018-10 (
https://dumps.wikimedia.org/other/clickstream/2018-10/clickstream-enwiki-2018-10.tsv.gz)
and found the following lines:
Global_warmingClimate_changelink3048
Hi all,
I've got a question on the completeness of the clickstream dataset. I
downloaded the dumps for 2018 from
https://dumps.wikimedia.org/other/clickstream/ (English Wikipedia only). When I
filter for the article pair "Climate change" and "Global warming" (either one
being either prev or
Dan,
Apologies for the late reply. I understand, we're mostly just interested in
en.wikipedia, not segmented by any year. Our work involves building a tool that
analyzes trending technologies, with Wikipedia as one source of data.
Best,
Celeste
From: Dan