Adding Simon Back as he might not be in the list.

On Mon, May 13, 2019 at 5:58 PM Joseph Allemandou <jalleman...@wikimedia.org>
wrote:

> Hi Simon,
> Thanks for reaching out :)
>
> I tried a similar analysis on our cluster with the same original files as
> the ones in dumps.wikimedia.org, using Spark to speed up computation.
> I ended up with coherent results for both the examples you gave:
>
> Sum - count Data
>
> date Climate_change --> Global_warming Global_warming --> Climate_change 
> *Total
> Result*
> 2017-11 3904 950 *4854*
> 2017-12 3549 780 *4329*
> 2018-01 4508 1011 *5519*
> 2018-02 3548 998 *4546*
> 2018-03 3462 745 *4207*
> 2018-04 3726 755 *4481*
> 2018-05 3730 810 *4540*
> 2018-06 2971 862 *3833*
> 2018-07 3500 1602 *5102*
> 2018-08 4546 1644 *6190*
> 2018-09 3962 1472 *5434*
> 2018-10 6155 3048 *9203*
> 2018-11 5865 2617 *8482*
> 2018-12 5491 2227 *7718*
> 2019-01 5774 2911 *8685*
> 2019-02 6311 2845 *9156*
> 2019-03 6858 2514 *9372*
> 2019-04 6824 2199 *9023*
>
>
> Sum - count Data
>
> date Air_pollution --> Smog Smog --> Air_pollution *Total Result*
> 2017-11 82 263 *345*
> 2017-12 200 184 *384*
> 2018-01 65 140 *205*
> 2018-02 82 98 *180*
> 2018-03 418 149 *567*
> 2018-04 295 137 *432*
> 2018-05 215 95 *310*
> 2018-06 245 85 *330*
> 2018-07 233 70 *303*
> 2018-08 36 62 *98*
> 2018-09 45 81 *126*
> 2018-10 66 96 *162*
> 2018-11 128 135 *263*
> 2018-12 50 90 *140*
> 2019-01 68 92 *160*
> 2019-02 50 68 *118*
> 2019-03 49 72 *121*
> 2019-04 33 51 *84*
> *Total Result* *2360* *1968* *4328*
>
> Maybe there is an issue in the way you process the data?
> Best
> Joseph
>
>
>
>
> On Mon, May 13, 2019 at 3:38 PM Simon Munzert <
> simon.munz...@googlemail.com> wrote:
>
>> Hi all,
>>
>> I've got a question on the completeness of the clickstream dataset. I
>> downloaded the dumps for 2018 from
>> https://dumps.wikimedia.org/other/clickstream/ (English Wikipedia only).
>> When I filter for the article pair "Climate change" and "Global warming"
>> (either one being either prev or curr) for all of 2018, this is what I get:
>>
>>   prev           curr           type      n month
>>   <chr>          <chr>          <chr> <dbl> <chr>
>> 1 Global_warming Climate_change link    755 2018-04
>> 2 Global_warming Climate_change link    810 2018-05
>> 3 Climate_change Global_warming link   3730 2018-05
>> 4 Climate_change Global_warming link   3962 2018-09
>> 5 Climate_change Global_warming link   5865 2018-11
>> 6 Climate_change Global_warming link   5491 2018-12
>> 7 Global_warming Climate_change link   2227 2018-12
>>
>> The visit numbers seem plausible. But why is there no data on, e.g.,
>> January to March? And why is there data for both directions in May and
>> December, but not for the others? This seems implausible given the
>> popularity of the articles.
>>
>> Here's another example:
>>
>>   prev          curr          type      n month
>>   <chr>         <chr>         <chr> <dbl> <chr>
>> 1 Smog          Air_pollution link    140 2018-01
>> 2 Air_pollution Smog          link     82 2018-02
>> 3 Air_pollution Smog          link    295 2018-04
>> 4 Air_pollution Smog          link    215 2018-05
>> 5 Smog          Air_pollution link     85 2018-06
>> 6 Air_pollution Smog          link    233 2018-07
>> 7 Air_pollution Smog          link     45 2018-09
>> 8 Smog          Air_pollution link     96 2018-10
>> 9 Smog          Air_pollution link     90 2018-12
>>
>> Am I missing something here?
>>
>> Thanks in advance,
>> Simon
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> --
> Joseph Allemandou (joal) (he / him)
> Sr Data Engineer
> Wikimedia Foundation
>


-- 
Joseph Allemandou (joal) (he / him)
Sr Data Engineer
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to