Adding Simon Back as he might not be in the list. On Mon, May 13, 2019 at 5:58 PM Joseph Allemandou <jalleman...@wikimedia.org> wrote:
> Hi Simon, > Thanks for reaching out :) > > I tried a similar analysis on our cluster with the same original files as > the ones in dumps.wikimedia.org, using Spark to speed up computation. > I ended up with coherent results for both the examples you gave: > > Sum - count Data > > date Climate_change --> Global_warming Global_warming --> Climate_change > *Total > Result* > 2017-11 3904 950 *4854* > 2017-12 3549 780 *4329* > 2018-01 4508 1011 *5519* > 2018-02 3548 998 *4546* > 2018-03 3462 745 *4207* > 2018-04 3726 755 *4481* > 2018-05 3730 810 *4540* > 2018-06 2971 862 *3833* > 2018-07 3500 1602 *5102* > 2018-08 4546 1644 *6190* > 2018-09 3962 1472 *5434* > 2018-10 6155 3048 *9203* > 2018-11 5865 2617 *8482* > 2018-12 5491 2227 *7718* > 2019-01 5774 2911 *8685* > 2019-02 6311 2845 *9156* > 2019-03 6858 2514 *9372* > 2019-04 6824 2199 *9023* > > > Sum - count Data > > date Air_pollution --> Smog Smog --> Air_pollution *Total Result* > 2017-11 82 263 *345* > 2017-12 200 184 *384* > 2018-01 65 140 *205* > 2018-02 82 98 *180* > 2018-03 418 149 *567* > 2018-04 295 137 *432* > 2018-05 215 95 *310* > 2018-06 245 85 *330* > 2018-07 233 70 *303* > 2018-08 36 62 *98* > 2018-09 45 81 *126* > 2018-10 66 96 *162* > 2018-11 128 135 *263* > 2018-12 50 90 *140* > 2019-01 68 92 *160* > 2019-02 50 68 *118* > 2019-03 49 72 *121* > 2019-04 33 51 *84* > *Total Result* *2360* *1968* *4328* > > Maybe there is an issue in the way you process the data? > Best > Joseph > > > > > On Mon, May 13, 2019 at 3:38 PM Simon Munzert < > simon.munz...@googlemail.com> wrote: > >> Hi all, >> >> I've got a question on the completeness of the clickstream dataset. I >> downloaded the dumps for 2018 from >> https://dumps.wikimedia.org/other/clickstream/ (English Wikipedia only). >> When I filter for the article pair "Climate change" and "Global warming" >> (either one being either prev or curr) for all of 2018, this is what I get: >> >> prev curr type n month >> <chr> <chr> <chr> <dbl> <chr> >> 1 Global_warming Climate_change link 755 2018-04 >> 2 Global_warming Climate_change link 810 2018-05 >> 3 Climate_change Global_warming link 3730 2018-05 >> 4 Climate_change Global_warming link 3962 2018-09 >> 5 Climate_change Global_warming link 5865 2018-11 >> 6 Climate_change Global_warming link 5491 2018-12 >> 7 Global_warming Climate_change link 2227 2018-12 >> >> The visit numbers seem plausible. But why is there no data on, e.g., >> January to March? And why is there data for both directions in May and >> December, but not for the others? This seems implausible given the >> popularity of the articles. >> >> Here's another example: >> >> prev curr type n month >> <chr> <chr> <chr> <dbl> <chr> >> 1 Smog Air_pollution link 140 2018-01 >> 2 Air_pollution Smog link 82 2018-02 >> 3 Air_pollution Smog link 295 2018-04 >> 4 Air_pollution Smog link 215 2018-05 >> 5 Smog Air_pollution link 85 2018-06 >> 6 Air_pollution Smog link 233 2018-07 >> 7 Air_pollution Smog link 45 2018-09 >> 8 Smog Air_pollution link 96 2018-10 >> 9 Smog Air_pollution link 90 2018-12 >> >> Am I missing something here? >> >> Thanks in advance, >> Simon >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > -- > Joseph Allemandou (joal) (he / him) > Sr Data Engineer > Wikimedia Foundation > -- Joseph Allemandou (joal) (he / him) Sr Data Engineer Wikimedia Foundation
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics