Re: [Wiki-research-l] [Release]

2015-03-02 Thread Oliver Keyes
Indeed! Orienting it that way (pivoting on language rather than
project) is something several people have asked for; I plan to spend a
chunk of my spare time (that is, recreational time) trying to make it
work. Should be fairly trivial.

On 2 March 2015 at 09:55, h  wrote:
> Hello Finn,
>I do not have a specific answer to your question. However, it might be
> worthwhile to add Finnish in to the comparison as according to the CLDR 26
> T-L information
> http://www.unicode.org/cldr/charts/26/supplemental/territory_language_information.html
>
>You have some sizable Finnish language speakers in Sweden:
>
> Swedish {O} sv 95.0% 99.0%
> Finnish {OR} fi 2.2%
>
> So if the similar query is executed on Finnish language, and the results
> also show some "undue" proportion of visits from Sweden, then what you
> observed as anomaly is the that unique. We probably need many iterations of
> comparative outcomes and normalization of data (Sweden does have higher
> population).  Also, it might be handy to have some statistics on immigration
> or residence, it is EU. I will not be surprised that for example the  visits
> from Oxford to Wikipedia website have sizable German language requests.
>
> I am still a bit bothered by the number "1" in the current dataset. It
> does not feel right since the numbers of 1.4% and 0.6% is a notable
> difference in this regard. Perhaps we need some high precision "universal
> percentage" number for each territory-language pair. It would be also great
> to do another set of aggregation: i.e. given a territory, which language
> versions of Wikipedia are accessed
>
> Best,
> han-teng liao
>
> 2015-03-02 13:54 GMT+01:00 Finn Årup Nielsen :
>>
>> Hi Oliver,
>>
>>
>> Interesting dataset! I am curious about why the Danish Wikipedia is so
>> highly acccessed from Sweden. Could it be an error, e.g., with Telia
>> IP-numbers?
>>
>> In Python:
>>
>> >>> import pandas as pd
>> >>> df =
>> >>> pd.read_csv('http://files.figshare.com/1923822/language_pageviews_per_country.tsv',
>> >>> sep='\t')
>> >>> df.ix[df.project == 'da.wikipedia.org', ['country',
>> >>> 'pageviews_percentage']].set_index('country') pageviews_percentage
>> country
>> Austria1
>> China  1
>> Denmark   61
>> Estonia1
>> France 1
>> Germany2
>> Netherlands2
>> Norway 1
>> Sweden18
>> United Kingdom 3
>> United States  3
>> Other  5
>>
>>
>> MaxMind has some numbers on their own accuracy:
>>
>> https://www.maxmind.com/en/geoip2-city-database-accuracy
>>
>> For Denmark 85% is "Correctly Resolved", for Sweden only 68%. I wonder if
>> this really could bias the result so much.
>>
>> If the numbers are correct why would the Swedish read the Danish Wikipedia
>> so much? Bots? It does not apply the other way around: Only 2% of the
>> traffic to Swedish Wikipedia comes from Denmark.
>>
>>
>>
>> best regards
>> Finn
>>
>>
>>
>> On 02/25/2015 10:06 PM, Oliver Keyes wrote:
>>>
>>> Hey all!
>>>
>>> We've released a highly-aggregated dataset of readership data -
>>> specifically, data about where, geographically, traffic to each of our
>>> projects (and all of our projects) comes from. The data can be found
>>> at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've
>>> put together an exploration tool for it at
>>> https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
>>>
>>> Hope it's useful to people!
>>>
>>
>>
>> --
>> Finn Årup Nielsen
>> http://people.compute.dtu.dk/faan/
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [Release]

2015-03-02 Thread h
Hello Finn,
   I do not have a specific answer to your question. However, it might be
worthwhile to add Finnish in to the comparison as according to the CLDR 26
T-L information
http://www.unicode.org/cldr/charts/26/supplemental/territory_language_information.html


   You have some sizable Finnish language speakers in Sweden:

Swedish {O} sv 95.0% 99.0%
Finnish {OR} fi 2.2%

So if the similar query is executed on Finnish language, and the
results also show some "undue" proportion of visits from Sweden, then what
you observed as anomaly is the that unique. We probably need many
iterations of comparative outcomes and normalization of data (Sweden does
have higher population).  Also, it might be handy to have some statistics
on immigration or residence, it is EU. I will not be surprised that for
example the  visits from Oxford to Wikipedia website have sizable German
language requests.

I am still a bit bothered by the number "1" in the current dataset. It
does not feel right since the numbers of 1.4% and 0.6% is a notable
difference in this regard. Perhaps we need some high precision "universal
percentage" number for each territory-language pair. It would be also great
to do another set of aggregation: i.e. given a territory, which language
versions of Wikipedia are accessed

Best,
han-teng liao

2015-03-02 13:54 GMT+01:00 Finn Årup Nielsen :

> Hi Oliver,
>
>
> Interesting dataset! I am curious about why the Danish Wikipedia is so
> highly acccessed from Sweden. Could it be an error, e.g., with Telia
> IP-numbers?
>
> In Python:
>
> >>> import pandas as pd
> >>> df = pd.read_csv('http://files.figshare.com/1923822/language_
> pageviews_per_country.tsv', sep='\t')
> >>> df.ix[df.project == 'da.wikipedia.org', ['country',
> 'pageviews_percentage']].set_index('country') pageviews_percentage
> country
> Austria1
> China  1
> Denmark   61
> Estonia1
> France 1
> Germany2
> Netherlands2
> Norway 1
> Sweden18
> United Kingdom 3
> United States  3
> Other  5
>
>
> MaxMind has some numbers on their own accuracy:
>
> https://www.maxmind.com/en/geoip2-city-database-accuracy
>
> For Denmark 85% is "Correctly Resolved", for Sweden only 68%. I wonder if
> this really could bias the result so much.
>
> If the numbers are correct why would the Swedish read the Danish Wikipedia
> so much? Bots? It does not apply the other way around: Only 2% of the
> traffic to Swedish Wikipedia comes from Denmark.
>
>
>
> best regards
> Finn
>
>
>
> On 02/25/2015 10:06 PM, Oliver Keyes wrote:
>
>> Hey all!
>>
>> We've released a highly-aggregated dataset of readership data -
>> specifically, data about where, geographically, traffic to each of our
>> projects (and all of our projects) comes from. The data can be found
>> at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've
>> put together an exploration tool for it at
>> https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
>>
>> Hope it's useful to people!
>>
>>
>
> --
> Finn Årup Nielsen
> http://people.compute.dtu.dk/faan/
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [Release]

2015-03-02 Thread Finn Årup Nielsen

Hi Oliver,


Interesting dataset! I am curious about why the Danish Wikipedia is so 
highly acccessed from Sweden. Could it be an error, e.g., with Telia 
IP-numbers?


In Python:

>>> import pandas as pd
>>> df = 
pd.read_csv('http://files.figshare.com/1923822/language_pageviews_per_country.tsv', 
sep='\t')
>>> df.ix[df.project == 'da.wikipedia.org', ['country', 
'pageviews_percentage']].set_index('country') 
pageviews_percentage

country
Austria1
China  1
Denmark   61
Estonia1
France 1
Germany2
Netherlands2
Norway 1
Sweden18
United Kingdom 3
United States  3
Other  5


MaxMind has some numbers on their own accuracy:

https://www.maxmind.com/en/geoip2-city-database-accuracy

For Denmark 85% is "Correctly Resolved", for Sweden only 68%. I wonder 
if this really could bias the result so much.


If the numbers are correct why would the Swedish read the Danish 
Wikipedia so much? Bots? It does not apply the other way around: Only 2% 
of the traffic to Swedish Wikipedia comes from Denmark.




best regards
Finn



On 02/25/2015 10:06 PM, Oliver Keyes wrote:

Hey all!

We've released a highly-aggregated dataset of readership data -
specifically, data about where, geographically, traffic to each of our
projects (and all of our projects) comes from. The data can be found
at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've
put together an exploration tool for it at
https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/

Hope it's useful to people!




--
Finn Årup Nielsen
http://people.compute.dtu.dk/faan/

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l