Re: [Analytics] most popular articles per country

2018-07-09 Thread Nuria Ruiz
Amir:

FYI that this data has couple caveats:

1) the "-" is pageviews for  a page for which we cannot extract a title.

2) data very much affected by bot spikes (you can mitigate that by
filtering by agent_type="user" but still, a significant portion of bot
traffic is not label as such).
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly#Changes_and_known_problems_since_2015-06-16

3) there are privacy considerations when number of views are small:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/
Pageviews/Pageviews_by_country#Is_Pageviews_by_country_privacy_sensitive


>Is anything like this already published anywhere? If it isn't, it may be
nice to publish such a thing, similarly to Google Zeitgeist.
We do not have immediate plans to do so due to privacy considerations. Now,
Dario's team has a project on this regard that might render datasets to be
published this year: https://meta.wikimedia.org/
wiki/Research:Quantifying_the_global_attention_to_public_
health_threats_through_Wikipedia_pageview_data

See also:
https://phabricator.wikimedia.org/T189339

Thanks,

Nuria

On Mon, Jul 9, 2018 at 5:41 AM, Amir E. Aharoni <
amir.ahar...@mail.huji.ac.il> wrote:

> Thanks. Another question: For some countries, the result is "-", for
> example Germany:
>
> Germany-en.wikipedia1275634
>
> Any idea why?
>
> (I modified the query a bit and added the "project" column. And yes, the
> fact that en.wikipedia is at the top in Germany is also quite odd.)
>
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
>
> 2018-07-09 15:17 GMT+03:00 Francisco Dans :
>
>> I think as long as you put in a filter so that the minimum pageviews is
>> maybe 1000, you should be fine privacy wise. I can't speak too much to your
>> second question.
>>
>> On Mon, Jul 9, 2018 at 1:59 PM, Amir E. Aharoni <
>> amir.ahar...@mail.huji.ac.il> wrote:
>>
>>> Thank you so much! In many countries it's
>>>
>>> A couple of questions:
>>> 1. Are any of the results of this query private? Or can I talk about
>>> them to people?
>>> 2. Is anything like this already published anywhere? If it isn't, it may
>>> be nice to publish such a thing, similarly to Google Zeitgeist.
>>>
>>>
>>> --
>>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
>>> http://aharoni.wordpress.com
>>> ‪“We're living in pieces,
>>> I want to live in peace.” – T. Moore‬
>>>
>>> 2018-07-09 13:19 GMT+03:00 Francisco Dans :
>>>
 Hi Amir,

 As Tilman has suggested, your best bet is to query the pageview_hourly
 table. I was going to be lazy and give you a query to just find out the
 most viewed article for a given country, but then I made a few experiments
 and this is the query I came up with to generate a list of countries and
 their respective most viewed articles and view counts. It takes a few
 minutes to run for a single day, so I'm sure someone here could suggest a
 better approach.

 WITH articles_countries AS (
> SELECT country, page_title, sum(view_count) AS views
> FROM pageview_hourly
> WHERE year=2018 AND month=3 AND day=15
> GROUP BY country, page_title
> )
> SELECT s.country as country, s.page_title as page_title, s.views as
> views
> FROM (
> SELECT max(named_struct('views', views, 'country', country,
> 'page_title', page_title)) as s from articles_countries group by country
> ) t;


 Cheers / see you in ZA,
 Fran


 On Mon, Jul 9, 2018 at 10:18 AM, Amir E. Aharoni <
 amir.ahar...@mail.huji.ac.il> wrote:

> Hi,
>
> Is there a way to find what are the most popular articles per country?
>
> Finding the most popular articles per language is easy with the
> Pageviews tool, but languages and countries are of course not the same.
>
> One thing I tried is going to Turnilo, webrequest_sampled_128, and
> filtering by country. But here it gets troublesome:
> * Splitting can be done by Uri host, which is *more or less* the
> project, or by Uri path, which is *more or less* the article (but see
> below), and I couldn't find a convenient way to combine them.
> * Mobile (.m.) and desktop hosts are separate. It may actually
> sometimes be useful to see differences (or lack thereof) between desktop
> and mobile, but combining them is often useful, too. This can probably be
> done with regular expressions, but this brings us to the biggest problem:
> * Filtering by Uri path would be useful if it didn't have so many
> paths for images, beacons, etc. Filtering using the regular expression
> "\/wiki\/.+" may be the right thing functionally, but in practice it's 
> very
> slow or doesn't work at all.
> * I don't know what exactly is logged in webrequest_sampled_128, but
> the name hints that it doesn't include 

Re: [Analytics] most popular articles per country

2018-07-09 Thread Amir E. Aharoni
Thanks. Another question: For some countries, the result is "-", for
example Germany:

Germany-en.wikipedia1275634

Any idea why?

(I modified the query a bit and added the "project" column. And yes, the
fact that en.wikipedia is at the top in Germany is also quite odd.)


--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
‪“We're living in pieces,
I want to live in peace.” – T. Moore‬

2018-07-09 15:17 GMT+03:00 Francisco Dans :

> I think as long as you put in a filter so that the minimum pageviews is
> maybe 1000, you should be fine privacy wise. I can't speak too much to your
> second question.
>
> On Mon, Jul 9, 2018 at 1:59 PM, Amir E. Aharoni <
> amir.ahar...@mail.huji.ac.il> wrote:
>
>> Thank you so much! In many countries it's
>>
>> A couple of questions:
>> 1. Are any of the results of this query private? Or can I talk about them
>> to people?
>> 2. Is anything like this already published anywhere? If it isn't, it may
>> be nice to publish such a thing, similarly to Google Zeitgeist.
>>
>>
>> --
>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
>> http://aharoni.wordpress.com
>> ‪“We're living in pieces,
>> I want to live in peace.” – T. Moore‬
>>
>> 2018-07-09 13:19 GMT+03:00 Francisco Dans :
>>
>>> Hi Amir,
>>>
>>> As Tilman has suggested, your best bet is to query the pageview_hourly
>>> table. I was going to be lazy and give you a query to just find out the
>>> most viewed article for a given country, but then I made a few experiments
>>> and this is the query I came up with to generate a list of countries and
>>> their respective most viewed articles and view counts. It takes a few
>>> minutes to run for a single day, so I'm sure someone here could suggest a
>>> better approach.
>>>
>>> WITH articles_countries AS (
 SELECT country, page_title, sum(view_count) AS views
 FROM pageview_hourly
 WHERE year=2018 AND month=3 AND day=15
 GROUP BY country, page_title
 )
 SELECT s.country as country, s.page_title as page_title, s.views as
 views
 FROM (
 SELECT max(named_struct('views', views, 'country', country,
 'page_title', page_title)) as s from articles_countries group by country
 ) t;
>>>
>>>
>>> Cheers / see you in ZA,
>>> Fran
>>>
>>>
>>> On Mon, Jul 9, 2018 at 10:18 AM, Amir E. Aharoni <
>>> amir.ahar...@mail.huji.ac.il> wrote:
>>>
 Hi,

 Is there a way to find what are the most popular articles per country?

 Finding the most popular articles per language is easy with the
 Pageviews tool, but languages and countries are of course not the same.

 One thing I tried is going to Turnilo, webrequest_sampled_128, and
 filtering by country. But here it gets troublesome:
 * Splitting can be done by Uri host, which is *more or less* the
 project, or by Uri path, which is *more or less* the article (but see
 below), and I couldn't find a convenient way to combine them.
 * Mobile (.m.) and desktop hosts are separate. It may actually
 sometimes be useful to see differences (or lack thereof) between desktop
 and mobile, but combining them is often useful, too. This can probably be
 done with regular expressions, but this brings us to the biggest problem:
 * Filtering by Uri path would be useful if it didn't have so many paths
 for images, beacons, etc. Filtering using the regular expression
 "\/wiki\/.+" may be the right thing functionally, but in practice it's very
 slow or doesn't work at all.
 * I don't know what exactly is logged in webrequest_sampled_128, but
 the name hints that it doesn't include everything. A sample may be OK for
 countries with a lot of traffic like U.S. or Spain, but for countries with
 smaller traffic this may start being a problem.

 Any better ideas?

 Thanks!

 --
 Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
 http://aharoni.wordpress.com
 ‪“We're living in pieces,
 I want to live in peace.” – T. Moore‬

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


>>>
>>>
>>> --
>>> *Francisco Dans*
>>> Software Engineer, Analytics Team
>>> Wikimedia Foundation
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> *Francisco Dans*
> Software Engineer, Analytics Team
> Wikimedia Foundation
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

Re: [Analytics] most popular articles per country

2018-07-09 Thread Francisco Dans
I think as long as you put in a filter so that the minimum pageviews is
maybe 1000, you should be fine privacy wise. I can't speak too much to your
second question.

On Mon, Jul 9, 2018 at 1:59 PM, Amir E. Aharoni <
amir.ahar...@mail.huji.ac.il> wrote:

> Thank you so much! In many countries it's
>
> A couple of questions:
> 1. Are any of the results of this query private? Or can I talk about them
> to people?
> 2. Is anything like this already published anywhere? If it isn't, it may
> be nice to publish such a thing, similarly to Google Zeitgeist.
>
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
>
> 2018-07-09 13:19 GMT+03:00 Francisco Dans :
>
>> Hi Amir,
>>
>> As Tilman has suggested, your best bet is to query the pageview_hourly
>> table. I was going to be lazy and give you a query to just find out the
>> most viewed article for a given country, but then I made a few experiments
>> and this is the query I came up with to generate a list of countries and
>> their respective most viewed articles and view counts. It takes a few
>> minutes to run for a single day, so I'm sure someone here could suggest a
>> better approach.
>>
>> WITH articles_countries AS (
>>> SELECT country, page_title, sum(view_count) AS views
>>> FROM pageview_hourly
>>> WHERE year=2018 AND month=3 AND day=15
>>> GROUP BY country, page_title
>>> )
>>> SELECT s.country as country, s.page_title as page_title, s.views as views
>>> FROM (
>>> SELECT max(named_struct('views', views, 'country', country,
>>> 'page_title', page_title)) as s from articles_countries group by country
>>> ) t;
>>
>>
>> Cheers / see you in ZA,
>> Fran
>>
>>
>> On Mon, Jul 9, 2018 at 10:18 AM, Amir E. Aharoni <
>> amir.ahar...@mail.huji.ac.il> wrote:
>>
>>> Hi,
>>>
>>> Is there a way to find what are the most popular articles per country?
>>>
>>> Finding the most popular articles per language is easy with the
>>> Pageviews tool, but languages and countries are of course not the same.
>>>
>>> One thing I tried is going to Turnilo, webrequest_sampled_128, and
>>> filtering by country. But here it gets troublesome:
>>> * Splitting can be done by Uri host, which is *more or less* the
>>> project, or by Uri path, which is *more or less* the article (but see
>>> below), and I couldn't find a convenient way to combine them.
>>> * Mobile (.m.) and desktop hosts are separate. It may actually sometimes
>>> be useful to see differences (or lack thereof) between desktop and mobile,
>>> but combining them is often useful, too. This can probably be done with
>>> regular expressions, but this brings us to the biggest problem:
>>> * Filtering by Uri path would be useful if it didn't have so many paths
>>> for images, beacons, etc. Filtering using the regular expression
>>> "\/wiki\/.+" may be the right thing functionally, but in practice it's very
>>> slow or doesn't work at all.
>>> * I don't know what exactly is logged in webrequest_sampled_128, but the
>>> name hints that it doesn't include everything. A sample may be OK for
>>> countries with a lot of traffic like U.S. or Spain, but for countries with
>>> smaller traffic this may start being a problem.
>>>
>>> Any better ideas?
>>>
>>> Thanks!
>>>
>>> --
>>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
>>> http://aharoni.wordpress.com
>>> ‪“We're living in pieces,
>>> I want to live in peace.” – T. Moore‬
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>>
>> --
>> *Francisco Dans*
>> Software Engineer, Analytics Team
>> Wikimedia Foundation
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Francisco Dans*
Software Engineer, Analytics Team
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] most popular articles per country

2018-07-09 Thread Amir E. Aharoni
Thank you so much! In many countries it's

A couple of questions:
1. Are any of the results of this query private? Or can I talk about them
to people?
2. Is anything like this already published anywhere? If it isn't, it may be
nice to publish such a thing, similarly to Google Zeitgeist.


--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
‪“We're living in pieces,
I want to live in peace.” – T. Moore‬

2018-07-09 13:19 GMT+03:00 Francisco Dans :

> Hi Amir,
>
> As Tilman has suggested, your best bet is to query the pageview_hourly
> table. I was going to be lazy and give you a query to just find out the
> most viewed article for a given country, but then I made a few experiments
> and this is the query I came up with to generate a list of countries and
> their respective most viewed articles and view counts. It takes a few
> minutes to run for a single day, so I'm sure someone here could suggest a
> better approach.
>
> WITH articles_countries AS (
>> SELECT country, page_title, sum(view_count) AS views
>> FROM pageview_hourly
>> WHERE year=2018 AND month=3 AND day=15
>> GROUP BY country, page_title
>> )
>> SELECT s.country as country, s.page_title as page_title, s.views as views
>> FROM (
>> SELECT max(named_struct('views', views, 'country', country,
>> 'page_title', page_title)) as s from articles_countries group by country
>> ) t;
>
>
> Cheers / see you in ZA,
> Fran
>
>
> On Mon, Jul 9, 2018 at 10:18 AM, Amir E. Aharoni <
> amir.ahar...@mail.huji.ac.il> wrote:
>
>> Hi,
>>
>> Is there a way to find what are the most popular articles per country?
>>
>> Finding the most popular articles per language is easy with the Pageviews
>> tool, but languages and countries are of course not the same.
>>
>> One thing I tried is going to Turnilo, webrequest_sampled_128, and
>> filtering by country. But here it gets troublesome:
>> * Splitting can be done by Uri host, which is *more or less* the project,
>> or by Uri path, which is *more or less* the article (but see below), and I
>> couldn't find a convenient way to combine them.
>> * Mobile (.m.) and desktop hosts are separate. It may actually sometimes
>> be useful to see differences (or lack thereof) between desktop and mobile,
>> but combining them is often useful, too. This can probably be done with
>> regular expressions, but this brings us to the biggest problem:
>> * Filtering by Uri path would be useful if it didn't have so many paths
>> for images, beacons, etc. Filtering using the regular expression
>> "\/wiki\/.+" may be the right thing functionally, but in practice it's very
>> slow or doesn't work at all.
>> * I don't know what exactly is logged in webrequest_sampled_128, but the
>> name hints that it doesn't include everything. A sample may be OK for
>> countries with a lot of traffic like U.S. or Spain, but for countries with
>> smaller traffic this may start being a problem.
>>
>> Any better ideas?
>>
>> Thanks!
>>
>> --
>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
>> http://aharoni.wordpress.com
>> ‪“We're living in pieces,
>> I want to live in peace.” – T. Moore‬
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> *Francisco Dans*
> Software Engineer, Analytics Team
> Wikimedia Foundation
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] most popular articles per country

2018-07-09 Thread Francisco Dans
Hi Amir,

As Tilman has suggested, your best bet is to query the pageview_hourly
table. I was going to be lazy and give you a query to just find out the
most viewed article for a given country, but then I made a few experiments
and this is the query I came up with to generate a list of countries and
their respective most viewed articles and view counts. It takes a few
minutes to run for a single day, so I'm sure someone here could suggest a
better approach.

WITH articles_countries AS (
> SELECT country, page_title, sum(view_count) AS views
> FROM pageview_hourly
> WHERE year=2018 AND month=3 AND day=15
> GROUP BY country, page_title
> )
> SELECT s.country as country, s.page_title as page_title, s.views as views
> FROM (
> SELECT max(named_struct('views', views, 'country', country,
> 'page_title', page_title)) as s from articles_countries group by country
> ) t;


Cheers / see you in ZA,
Fran


On Mon, Jul 9, 2018 at 10:18 AM, Amir E. Aharoni <
amir.ahar...@mail.huji.ac.il> wrote:

> Hi,
>
> Is there a way to find what are the most popular articles per country?
>
> Finding the most popular articles per language is easy with the Pageviews
> tool, but languages and countries are of course not the same.
>
> One thing I tried is going to Turnilo, webrequest_sampled_128, and
> filtering by country. But here it gets troublesome:
> * Splitting can be done by Uri host, which is *more or less* the project,
> or by Uri path, which is *more or less* the article (but see below), and I
> couldn't find a convenient way to combine them.
> * Mobile (.m.) and desktop hosts are separate. It may actually sometimes
> be useful to see differences (or lack thereof) between desktop and mobile,
> but combining them is often useful, too. This can probably be done with
> regular expressions, but this brings us to the biggest problem:
> * Filtering by Uri path would be useful if it didn't have so many paths
> for images, beacons, etc. Filtering using the regular expression
> "\/wiki\/.+" may be the right thing functionally, but in practice it's very
> slow or doesn't work at all.
> * I don't know what exactly is logged in webrequest_sampled_128, but the
> name hints that it doesn't include everything. A sample may be OK for
> countries with a lot of traffic like U.S. or Spain, but for countries with
> smaller traffic this may start being a problem.
>
> Any better ideas?
>
> Thanks!
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Francisco Dans*
Software Engineer, Analytics Team
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] most popular articles per country

2018-07-09 Thread Tilman Bayer
One can use the pageview_hourly

table for this.

On Mon, Jul 9, 2018 at 1:18 AM, Amir E. Aharoni <
amir.ahar...@mail.huji.ac.il> wrote:

> Hi,
>
> Is there a way to find what are the most popular articles per country?
>
> Finding the most popular articles per language is easy with the Pageviews
> tool, but languages and countries are of course not the same.
>
> One thing I tried is going to Turnilo, webrequest_sampled_128, and
> filtering by country. But here it gets troublesome:
> * Splitting can be done by Uri host, which is *more or less* the project,
> or by Uri path, which is *more or less* the article (but see below), and I
> couldn't find a convenient way to combine them.
> * Mobile (.m.) and desktop hosts are separate. It may actually sometimes
> be useful to see differences (or lack thereof) between desktop and mobile,
> but combining them is often useful, too. This can probably be done with
> regular expressions, but this brings us to the biggest problem:
> * Filtering by Uri path would be useful if it didn't have so many paths
> for images, beacons, etc. Filtering using the regular expression
> "\/wiki\/.+" may be the right thing functionally, but in practice it's very
> slow or doesn't work at all.
> * I don't know what exactly is logged in webrequest_sampled_128, but the
> name hints that it doesn't include everything. A sample may be OK for
> countries with a lot of traffic like U.S. or Spain, but for countries with
> smaller traffic this may start being a problem.
>
> Any better ideas?
>
> Thanks!
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics