[Analytics] Re: best programme ot work with data

2023-01-26 Thread Federico Leva (Nemo)

Il 26/01/23 18:04, Robert Garrigos ha scritto:
with some 1.5milion rows, I can not open it with numbers or libreoffice 
to do sum of the column 4.

Which tools do you use to work with such big files?

To sum a column in a CSV I would use visidata:

I think I've used it with CSVs in the order of 10^7 rows, possibly 10^8. 
(Usually to make pivot tables.)

Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

[Analytics] Re: Pageviews per country

2022-12-21 Thread Federico Leva (Nemo)

Il 21/12/22 19:55, Ismael Olea ha scritto:

About the rationale, one of the bigger drivers nowadays is the well known
link between heritage, tourism and sustainability (example: the Sustainable
Development Goals), so there is a trend to better analyze this context to
study and plan. Usually touristic destinations have very well defined
countries of origin.

That's also an example where correlation between language and country of 
origin has been useful! I'd love to see a replication of Hinnosaar's 
"Wikipedia matters" but with GLAM contributions and across different 


Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

[Analytics] Re: Pageviews per country

2022-12-21 Thread Federico Leva (Nemo)

Il 20/12/22 20:03, Ismael Olea ha scritto:

We are working with a heritage institution in a GLAM project and they are
interested in access statistics for the resources they have released in
Wikimedia. I think I got the point about how the pageviews concept is and
how to use it but, as far as I understand, it's not possible to get
details like article pageviews, for example, per country.

Depending on what you're interested in, it might be a sufficiently good 
approximation to look at usage by language.

The short case study about BEIC https://doi.org/10.4403/jlis.it-12481 
can give some ideas for statistics to track. See also

* https://commons.wikimedia.org/wiki/Commons:BEIC (brief overview)
* https://it.wikipedia.org/wiki/Progetto:GLAM/BEIC/2015-07#Sommario 
(analysis in Italian)

If you know the totals for the downloads from Commons, and you get some 
idea of the distribution by looking at the usage by language or other 
sources, that might be enough. There's always a certain level of 
uncertainty, so the exact absolute numbers are rarely that telling. BEIC 
for example was interested in the (order of magnitude of the) totals and 
it was useful to know the approximate share of traffic/interest from 
outside Italy (was it 1, 10 or 99 %?) and how much was due to ongoing 
"external" interest.

Note that in the mediacounts you can get additional hints by checking 
the share of requests coming from typical visits on Wikipedia (default 
thumbnail sizes), visits on Commons or downloads (raw files) and hotlinks.

Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

Re: [Analytics] Growth in reader engagement since 2016?

2021-04-29 Thread Federico Leva (Nemo)

Thanks Kate for the update!

Il 30/04/21 02:08, Kate Zimmerman ha scritto:

[...] active editors increased 18 percent, not 36 percent[3,5]. [...]
[3] December 2020 content interactions and active editor data from
[5] December 2015 active editor data from an internal query across all
Wikimedia sites, showing 79,420 active editors

What kind of internal query? You can't compare active editor numbers 
calculated with different methods. For instance, you need to have the same:

* content pages selection,
* user activity thresholds,
* global users aggregation/deduplication,
* exclusion of bots,
* seasonality corrections,
* and crucially, distance from the observed period.

December 2020 data is less than 4 months old, so the active editor 
figures for the month will keep decreasing for a while until deletion 
activity is mostly done for the period. If you want to compare it to 
December 2015 data, you'd need to use the same method and run it against 
a snapshot of the data taken in April 2016.

Best regards,

Analytics mailing list

Re: [Analytics] Growth in reader engagement since 2016?

2021-04-03 Thread Federico Leva (Nemo)

Il 17/03/21 10:53, Tilman Bayer ha scritto:

Are the underlying numbers published somewhere?

Hanlon's razor suggests to look at the most stupid explanation available 
for this number. The easiest piece of statistics currently available to 
someone who's looking for one in a hurry is the "unique devices" bit for 
"all Wikipedias":

This metric is widely misunderstood, but it's used nevertheless. It's 
even incorrectly quoted in the second paragraph of [[Wikipedia]] to 
support a figure on "unique visitors" (of which it says nothing).

It might be a coincidence, but the latest reading on that metric is 130 
% of the first: it went from 1460 G in 2017-04 to 1910 G in 2021-03. 
Most of this change is presumably the increase in the number of devices 
available to people, but the trend might be meaningful when narrowed 
down to shorter periods and/or geographies/languages where the number of 
devices per household has remained relatively constant in this period.


Analytics mailing list

Re: [Analytics] Wikisource pageviews by agent and method

2020-06-15 Thread Federico Leva (Nemo)

Dan Andreescu, 15/06/20 16:37:

Nemo would this, our next-up priority for Wikistats
, help?  Basically, it would let
you filter on two different dimensions, so you can look at just user
desktop or spider mobile, etc.

Maybe. I don't have a pressing need for this breakdown, I'm just trying 
to make sure I understand.

More "interesting" might be the question of what should the default be 
on the main page  but 
that's more a "political" decision or a matter of tradition. I don't 
consider the current choice particularly misleading because it takes 
only two clicks to find out that the totals include various kinds of 

On Mon, Jun 15, 2020 at 7:41 AM Francisco Dans wrote:

That's correct.

Alright. I just naively assumed that most bots would be classified as 
"desktop", given there's practically never a good reason to crawl the 
mobile domain, so I was surprised by the numbers. Either there's a lot 
of "bot" activity on the mobile domain, or there's very little "user" 
activity on desktop.

Already with the breakdown available, one can tell that it's better to 
be very careful about using such pageviews numbers about this project. 
I'm probably going to only mention unique devices in public, or the 
order of magnitude of the pageviews without further specifications.


Analytics mailing list

[Analytics] Wikisource pageviews by agent and method

2020-06-15 Thread Federico Leva (Nemo)
The pageviews statistics for the Italian Wikisource are very confusing 
to me:

In May there were supposedly more than 5 million pageviews, of which 3M 
desktop + 2M mobile and 3M "user" + 2M "spider". Do the "spider" 
pageviews include both the desktop and mobile URLs?


Analytics mailing list

[Analytics] 43k monthly active editors on the English Wikipedia

2020-05-23 Thread Federico Leva (Nemo)
The English Wikipedia is showing a pattern that I don't notice on 
several other wikis. If I'm not mistaken, in April 2020 monthly active 
editors passed 43k for the first time since 2011 (the year when 
MobileFrontend was created).

(As usual there will be a deflation of the number in a few months, after 
the deletions have run their course. The 43k threshold may still hold.)

The April peak looks like it continued and reinforced one of the 
now-usual October/January/March peaks. Do we know how much of this 
growth is organic or across the board and how much is amplification of 
existing known seasonal patterns (WikiEdu?).

Analytics mailing list

Re: [Analytics] Effects on Wikimedia web traffic trends from sites that reuse Wikimedia content and/or trademarks

2019-07-30 Thread Federico Leva (Nemo)
Because there are hundreds of mirrors and new ones are born or die about 
every week, it's probably worth mentioning we have some lists.



Analytics mailing list

Re: [Analytics] Wikistats2 Better maps and new metric: Legacy Pageviews (a.k.a Pagecounts)

2018-07-11 Thread Federico Leva (Nemo)

The visualisation is very clear.

Nuria Ruiz, 11/07/2018 23:29:
Also, we have included legacy pageviews in the UI, we used to call these 
pagecounts and prior to June 2015 this is the metric that we reported as 
pageviews for all wikimedia sites.

Good to have historical data too.


Analytics mailing list

Re: [Analytics] temporary drop in pageviews to ig.wikipedia

2018-06-21 Thread Federico Leva (Nemo)
I see the baseline is less than 200k monthly unique devices and there 
were no huge drops:


Absent trivial errors, such misclassifications of entire countries have 
been caused in the past by ISP changes. (I think the biggest case was 
Australia at some point, it was in Erik Zachte's blog.) Things like 
moving large IP ranges or slices of traffic from one operator to 
another. If I remember correctly, some big internet exchanges have been 
built recently in Central Africa, which may have prompted changes.

I'm not sure at what rate the GeoIP information is updated.


Analytics mailing list

Re: [Analytics] pageviews before 2015

2018-06-12 Thread Federico Leva (Nemo)

Saqib Q, 12/06/2018 13:08:
OK but how to do it? Do I need to install some application to extract 
the page views data of some particular pages ?

Just using grep should suffice, to produce a CSV you can open with 
LibreOffice or any spreadsheet software. But yes, you need some basic 
command line skills for that dataset.


Analytics mailing list

Re: [Analytics] pageviews before 2015

2018-06-12 Thread Federico Leva (Nemo)

Saqib Q, 11/06/2018 22:08:
I need to get page views of some bios from 2013 to 2015. Can anyone help 
me ? the current page views stats is not of any help.

Have you checked 


Analytics mailing list

Re: [Analytics] Question about the "Page Views" tool

2018-03-07 Thread Federico Leva (Nemo)

Reception123 ., 06/03/2018 08:25:
I was wondering how one could install and use the "Page Views" tool that 
Wikimedia uses, on a non-WMF wiki.

I guess you could rebuild the entire cache and analytics clusters from 
puppet (supposedly documented somewhere around 
), or write something 
from scratch that would expose data with the same API format.


Analytics mailing list

Re: [Analytics] A new landing page for the Wikimedia Research team

2018-02-07 Thread Federico Leva (Nemo)

Will it be translatable with standard tools?


Analytics mailing list

Re: [Analytics] Tool to visualize which wiki pages link to which wiki pages?

2017-11-21 Thread Federico Leva (Nemo)

Andre Klapper, 21/11/2017 17:15:

I've been wondering if anyone's aware of any visualization tool that
draws a graph showing which wiki pages are linked from which other wiki
pages (up to a certain depth)

The closest thing I can think of is Erik's chart of category links, 
generated with a script which is published somewhere and could be 
adapted at least for simple regex filters.


There's also 
a graph of links between user pages, which was made perhaps in 2014.


Analytics mailing list

Re: [Analytics] Undocumented project code in pagecounts-ez

2017-11-14 Thread Federico Leva (Nemo)

Michael Baldwin, 14/11/2017 04:43:
However, I've been coming across a large number of wiki codes "en.m". 
The "m" code is undocumented. It appears to be the mobile version of 
Wikipedia, but can anyone confirm that? Should the page be updated with 
this information?

Historically we collect most docs here:


Analytics mailing list

Re: [Analytics] Google Code-in: Get your tasks for young contributors prepared!

2017-10-17 Thread Federico Leva (Nemo)

Lars Noodén, 17/10/2017 17:13:

Ok.  Is there a checklist of things to do that I may work on that task

In theory 
which was linked from 


Analytics mailing list

Re: [Analytics] Google Code-in: Get your tasks for young contributors prepared!

2017-10-17 Thread Federico Leva (Nemo)

Lars Noodén, 17/10/2017 16:16:

Would it be possible to add T144714, or something based on it, to the list?


No, because it would require the minor to sign an NDA.


Analytics mailing list

Re: [Analytics] Drop in mainpage pageviews?

2017-07-15 Thread Federico Leva (Nemo)

Strainu, 15/07/2017 12:46:

Starting from an unrelated discussion on meta, I noticed a significant
drop in main page views for several wikis starting from April this
year. Is there anything we (or Google) did at that time to justify
this drop?

Curiously, if you sum all languages the numbers are much more stable, 
with some periods getting noticeably higher or lower (July-August 2016 
was a known bug IIRC; January-March up and March-May 2017 down I don't 


Analytics mailing list

Re: [Analytics] Fwd: follow-up on editors

2017-03-22 Thread Federico Leva (Nemo)

Aaron Halfaker, 22/03/2017 22:43:

__· __Number of editors who contribute 1 edit per month? __

First column of 
https://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm .

__· Is it possible/feasible to run editor retention metrics
globally (versus just based on a single project?

This depends if one just wants to (deduplicate and) sum different 
projects, or also consider interwiki events (such as a person stopping 
activity on a wiki but resuming activity on another wiki). I remember 
something was done a few years ago to see if Wikidata removed active 
editors from other projects and a few "migration" paths were identified 
in all directions. I can't find the chart/table now though.

__· __Total number of editors on all projects over the past 16
years (not just ENWP)?

For a quick estimate I usually make a proportion 
: https://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm ~ 
https://stats.wikimedia.org/EN/TablesWikipediaEN.htm#editdistribution : 
x . If the active editors in most classes are about 1 : 2, then probably 
the total number of editors in all Wikipedias + all other project is 
more than twice the English Wikipedia's total e.g. over 10 millions (or 
over 2 millions if you consider the usual 10 edits threshold).

__· __Global distribution of editors by region (or country),



Analytics mailing list

Re: [Analytics] Top editors in a certain namespace across sites?

2017-03-22 Thread Federico Leva (Nemo)

(I confirm my advice. I usually use Labs of course.)


Analytics mailing list

Re: [Analytics] Top editors in a certain namespace across sites?

2017-03-22 Thread Federico Leva (Nemo)

Andre Klapper, 22/03/2017 13:51:

Does anyone know of a way to look up the top editors for a certain
namespace (like "Module") across all Wikimedia sites?

The easiest way is usually to run the relevant SELECT queries with a 
small bash script on Labs or with sql.php on tin (e.g. 
https://phabricator.wikimedia.org/T128326#3100126 ).


Analytics mailing list

Re: [Analytics] Strange results from Wikistats

2017-02-26 Thread Federico Leva (Nemo)

Neil Patel Quinn, 22/02/2017 03:21:

Any idea what's going on?

See https://phabricator.wikimedia.org/T158500


Analytics mailing list

Re: [Analytics] Glamorous & Massview report?

2017-02-26 Thread Federico Leva (Nemo)

Itzik - Wikimedia Israel, 26/02/2017 17:56:

ammm.. maybe a easier way for someone who don't want to play with code
and download dumps? :)

Does a standard command like grep qualify as easier? :-) On a computer 
with some bandwidth I did something like:

wget -r -np -nH -nd -A bz2 
; find -name "mediacounts*bz2" -print0 | xargs -0 -P8 -I§ -n1 bzgrep 
webm § | grep -E '/Channel_?2.+webm' > 2016-12-channel2.csv

Which gives me about 650k accesses during December 2016, of which 10500 
downloads as complete file and 8k streamed plays.


Description: application/vnd.oasis.opendocument.spreadsheet
Analytics mailing list

Re: [Analytics] Glamorous & Massview report?

2017-02-26 Thread Federico Leva (Nemo)

Itzik - Wikimedia Israel, 26/02/2017 16:13:

file from a specific commons category ("Wikimedia Israel - Channel 2



Analytics mailing list

[Analytics] Missing mediacounts for 2016-12-01

2017-02-16 Thread Federico Leva (Nemo)

As far as I can see, mediacounts.2016-12-01.v00.tsv.bz2 is missing:



Analytics mailing list

Re: [Analytics] stats.grok.se used in study about Snowden and internet traffic

2017-01-19 Thread Federico Leva (Nemo)

Dan Andreescu, 19/01/2017 23:42:

there are no ways to know if data is missing or there are actual gaps.

These were documented on the FAQ though, also based on Erik Zachte's 


Analytics mailing list

Re: [Analytics] stats.grok.se used in study about Snowden and internet traffic

2017-01-19 Thread Federico Leva (Nemo)

Dan Andreescu, 19/01/2017 20:09:

now that stats.grok is completely down.

It's not, AFAICT: http://stats.grok.se/en/200712/Britney_Spears
Only the new data is missing (since January 2016), as stated on the FAQ 


Analytics mailing list

Re: [Analytics] Unusually high traffic to multiple language Wikipedias from France in October 2016?

2017-01-08 Thread Federico Leva (Nemo)

Vipul Naik, 08/01/2017 08:13:

It looks like a bot that happened to run in France but didn't get
classified by the existing algorithms as a bot.

Seems most likely indeed, especially since France appears as first 
country for nearly all wikis from Esperanto and below (<= 0.01% share of 
global total) at 
, closely followed by USA.

France is also 97 % of traffic for bm.wiki, which has a -87% that month, 
while we have no breakdown for kg, rn, tpi, ee, lg, cr; bg (+124 %) and 
wuu (+81 %) seem different patterns. (For convenience, I attach the 
TablesPageViewsMonthlyOriginal.htm row for the month sorted by variation.)

Does anybody have any other ideas about what might have happened here?

It can happen that some large IP ranges are reassigned to a different 
ISP, or otherwise get used in different countries, so that a significant 
portion of traffic seems to "move" to other countries: for instance we 
had this problem with Australia once 
). I'm not sure any ISP could have real users for so many languages: 
perhaps a huge mobile operator (Orange?) or some hosting suddenly used 
for a lot of proxying (OVH?), but I wouldn't bet on it.


Description: application/vnd.oasis.opendocument.spreadsheet
Analytics mailing list

Re: [Analytics] On Wikipedia edits archive per county.

2017-01-02 Thread Federico Leva (Nemo)
Probably someone at WMF would be more appropriate for a call; I can only 
share information which is online.

Rafael Escalona Reynoso, 02/01/2017 17:40:

Thank you for your reply. While reviewing both tables, the process to define
the total number of edits per country is still a bit ambiguous. For example,
in the case of Spanish, edits can be allocated to various countries.

Sorry, I should have been clearer. What you need is the language 
breakdown, like this (which was last calculated in 2013, it seems):


If you only consider the numbers above 3 % or so, these tend to be 
rather stable, so the 2013 data is still a good approximation.

addition not all quarters for 2015 are available.

Would it be possible to obtain instead the overall number of edits per
country for the year 2015?

Ah sorry, I thought you wanted historical data more than recent data. 
Updates have been stalled by WMF while they work on a potential 
replacement, so I suspect you can only adapt older data or wait.


Analytics mailing list

Re: [Analytics] On Wikipedia edits archive per county.

2017-01-02 Thread Federico Leva (Nemo)

Again, we are exclusively looking for the absolute number of Wikipedia
updates per year per county.

https://stats.wikimedia.org/wikimedia/squids/ has data on 23 quarters. 
To get the absolute number, you can multiply the percentage by the 
totals at https://stats.wikimedia.org/EN/TablesDatabaseEdits.htm .

These are HTML tables, but nothing LibreOffice Calc can't digest.


Analytics mailing list

Re: [Analytics] Does Analytics do site traffic and SEO measurement kinds of things?

2016-12-23 Thread Federico Leva (Nemo)
The scarce traffic on Wikivoyage is not especially surprising: there is 
a lot of competition in this niche and Wikivoyage is not particularly 
different in content etc. from some of its competitors.

The evidence accumulated in the last few years suggests that we should 
accept that Wikivoyage is one of those Wikimedia projects in the lower 
end of traffic and with little hope for growth (like Wikiversity and 
Wikinews in most languages). They are still useful for very specific 
qualitative goals (such as collaboration with institutions which could 
not have happened on other wikis) and should be assessed against those.


Analytics mailing list

Re: [Analytics] Page view statistics for Wikimedia projects - time series resolution

2016-12-21 Thread Federico Leva (Nemo)

Laurentiu Checiu, 21/12/2016 18:46:

 Would it be possible to find the above mentioned time series resolution
at millisecond (ms) ?

Definitely not with (current) public data, but if you use the currently 
maintained pageviews data you can have hourly resolution:



Analytics mailing list

Re: [Analytics] [Reminder] eventlogging mysql/analytics stores maintenance

2016-12-09 Thread Federico Leva (Nemo)

This will happen today

For the archives' sake, I believe "this" stands for what announced in 


Analytics mailing list

Re: [Analytics] ensuring reader anonymity

2016-11-11 Thread Federico Leva (Nemo)

Dan Andreescu, 10/11/2016 16:00:

I don't have as clear a reason for why we store the plain IP in
webrequest.  I think we could count uniques and all that other stuff
with the IP hash.  It's a good question, tentative +1 unless I'm
forgetting something.

I support any decrease of the storage of plain IP addresses. See also 
for more references.


Analytics mailing list

Re: [Analytics] Identifying bots and bot edit decline

2016-10-11 Thread Federico Leva (Nemo)
Wikistats knows about 8017 bot usernames according to 
(cut -f2 -d, StatisticsBots.csv | sort -u | wc -l ). Given active 
editors tend to complain a lot if they get counted as bots, a 
comprehensive list should probably be a superset of that one.

Flöck, Fabian, 11/10/2016 11:15:

This is likely not news, so can someone enlighten me regarding what brought 
about that sharp decline of bot edits?

The migration of interwiki links to Wikidata, which is very visible in 
https://stats.wikimedia.org/EN/PlotsPngEditHistoryTop.htm .

There was also some statistic by WMF on whether active users had 
"migrated" to Wikidata from other projects, but I can't quickly find it 
now; maybe it was around the time of 


Analytics mailing list

Re: [Analytics] Where is the 2010 survey?

2016-09-27 Thread Federico Leva (Nemo)

Reem Al-Kashif, 27/09/2016 19:46:

I was wondering where the famous (or rather infamous) 2010 survey is?
This one was made by the WMF and showed that women made less than 13% of
WP contributors (mentioned here

The category at the end of that page leads to 


Analytics mailing list

Re: [Analytics] Parsing user agents in EventLogging data

2016-09-14 Thread Federico Leva (Nemo)

Tilman Bayer, 15/09/2016 01:21:

This came up recently with the Reading web team, for the purpose of
investigating whether certain issues are caused by certain browsers
only. But I imagine it has arisen in other places as well.



Analytics mailing list

Re: [Analytics] Do most of the articles really receive little to no edits?

2016-09-07 Thread Federico Leva (Nemo)

Reem Al-Kashif, 07/09/2016 15:52:

I always hear people saying that most of the articles usually receive
little to no edits

Do you mean that many articles
* have not been edited in a long time (6+ months?),
* have few revisions (that is?), or
* have only a human editor or two?

(and that is used to encourage participants to make
sure their articles are good enough).

Dubious reasoning; other factors kept unchanged, articles with errors or 
other deficiencies are more likely to be edited further.

I would like to know if there are
statistics that support this for the English and Arabic Wikipedia.

Wikistats reports on the average number of edits per article (while 
you'd need a median at least): 

MediaWiki tells you the 5000 oldest pages 
https://ar.wikipedia.org/wiki/Special:AncientPages and you can easily 
replicate such a query e.g. on http://quarry.wmflabs.org/


Analytics mailing list

Re: [Analytics] Recent cross-language view stats

2016-09-05 Thread Federico Leva (Nemo)

Leon Ziemba, 05/09/2016 22:20:

I'm not sure if this restriction actually helps with performance, maybe
others could shed light on this?

It's the usual problem with https://phabricator.wikimedia.org/T125345


Analytics mailing list

Re: [Analytics] Analysing link

2016-08-26 Thread Federico Leva (Nemo)

Jan Dittrich, 26/08/2016 10:03:

or even click paths

Do you know about 


Analytics mailing list

Re: [Analytics] Ranking Wikimedia projects by sizes or activity levels

2016-07-11 Thread Federico Leva (Nemo)

http://wikistats.wmflabs.org/ has this on the main page.

https://www.wikimedia.org/ is the easy way to remember the rank by 
"size", which as always is determined by how used they are.


Analytics mailing list

Re: [Analytics] Survey for Wikipedia readers

2016-05-30 Thread Federico Leva (Nemo)

Vipul Naik, 31/05/2016 02:51:

Any feedback on the survey questions would also be appreciated, on- or

You should specify what the data will be used for.


Analytics mailing list

Re: [Analytics] Retrieving filenames for category

2016-05-20 Thread Federico Leva (Nemo)

Sander Ubink, 20/05/2016 14:30:

We cannot find out how BaGLAMa collects the filenames for all files
within a category.



Analytics mailing list

Re: [Analytics] Video view stats

2016-05-17 Thread Federico Leva (Nemo)

Itzik - Wikimedia Israel, 17/05/2016 11:29:

do we have a tool to pull this
numbers (for people without sql access)...

Yes, the same as earlier: 

Unless you mean "for people without command line access on their 
computer", in which case I don't remember.


Analytics mailing list

Re: [Analytics] [WikimediaMobile] "Among mobile sites, Wikipedia reigns in terms of popularity"

2016-05-11 Thread Federico Leva (Nemo)
Thanks; Nielsen data can indeed be very useful, I asked about it earlier 
because I'd love to have it again for Italy.



Tilman Bayer, 11/05/2016 19:23:

New study (US only) by the Knight Foundation:
https://medium.com/mobile-first-news-how-people-use-smartphones-to ,
summarized here:

"People spent more time on Wikipedia’s mobile site than any other news
or information site in Knight’s analysis, about 13 minutes per month
for the average visitor. CNN wasn’t too far behind, at 9 minutes 45
seconds per month. BuzzFeed clocked in third at 9 minutes 21 seconds
per month. (BuzzFeed, however, slays both CNN and Wikipedia in time
spent with the sites’ apps, compared with mobile websites. BuzzFeed
users devote more than 2 hours per month to its apps, compared with
about 46 minutes among CNN app users and 31 minutes among Wikipedia
app loyalists.)

Another way to look at Wikipedia’s influence: Wikipedia reaches almost
one-third of the total mobile population each month, according to
Knight’s analysis, which used data from the audience-tracking firm

Analytics mailing list

Re: [Analytics] Reports on Views and Edits per Country

2016-04-18 Thread Federico Leva (Nemo)

Andre Klapper, 18/04/2016 13:29:

says it's "Discontinued since June 2015" but does not tell me where to
find recent view/edit reports per country.

Anybody knows if recent data exists and where it's available?

The main page on the matter is/was 
; some reports are marked as migrated.


Analytics mailing list

Re: [Analytics] Data Request: How many % of new WP Users do submit an Email?

2016-04-13 Thread Federico Leva (Nemo)
When you say "users in the german speaking area", would you content 
yourself with newly created accounts on the German Wikipedia (and sister 
projects in German)?


Nuria Ruiz, 13/04/2016 17:44:

(cc-ing analytics@, our public list in case other contributors can chime
in for ideas)

 >For an evaluation of one of the options I need to know how many
percent of the newly registered users in the german speaking area submit
an >email address - on average. Do you have any data on that that you
could share with us? This would be very helpful!

We do not have data on this regard, our datasets are geared towards
pageviews and edits for the most part and this is neither. Now, if you
have some development resources available this is data that a developer
can get from the mediawiki database without too much effort.



On Wed, Apr 13, 2016 at 7:27 AM, Erik Zachte mailto:ezac...@wikimedia.org>> wrote:

Hi Katharina,

Sorry I have no data for this.

Maybe our Analytics Team can help?
Referring you to Nuria Ruiz, who leads the team.


-Original Message-
From: Katharina Nocun [mailto:katharina.no...@wikimedia.de
Sent: Wednesday, April 13, 2016 13:51
To: erikzac...@infodisiac.com 
Subject: Data Request: How many % of new WP Users do submit an Email?

Dear Erik,

let be briefly introduce myself: I work for Wikimedia Germany as
campaigns manager. Currently we are evaluating different options for
a campaign that shall attract new contributors for Wikipedia in the
german speaking area. For an evaluation of one of the options I need
to know how many percent of the newly registered users in the german
speaking area submit an email address - on average. Do you have any
data on that that you could share with us? This would be very helpful!

Analytics mailing list

Re: [Analytics] [Data Release] [Data Deprecation] [Analytics Dumps]

2016-03-23 Thread Federico Leva (Nemo)

Dan Andreescu, 23/03/2016 15:58:

*Clean-up:* Analytics data on dumps was crammed into /other with
unrelated datasets.  We made a new page to receive current and future
datasets [3] and linked to it from /other and /.  Please let us know if
anything there looks confusing or opaque and I'll be happy to clarify.

I assume the old URLs will redirect to the new ones, right?


Analytics mailing list

Re: [Analytics] Dark traffic

2016-03-01 Thread Federico Leva (Nemo)

James Forrester, 01/03/2016 15:59:

to be more of a "good citizen" of the Internet

...people should make their websites HTTPS.


Analytics mailing list

Re: [Analytics] Pagecounts dumps page title UTF-8 escaping

2016-02-03 Thread Federico Leva (Nemo)

Bo Han, 04/02/2016 00:40:

Is the logic for the escaping available somewhere?

MediaWiki API does https://phabricator.wikimedia.org/T29849
For the new pageviews API I got this reply on Unicode normalisation: 

(Phabricator is down right now; wait a couple hours or check 


Analytics mailing list

Re: [Analytics] Pageview stats tools

2016-01-31 Thread Federico Leva (Nemo)

Pine W, 31/01/2016 09:07:

Apologizes if this information was already published and I missed it.



Analytics mailing list

Re: [Analytics] WikimediaBot convention

2016-01-28 Thread Federico Leva (Nemo)

Marcel Ruiz Forns, 28/01/2016 01:15:

we (Analytics team) never finished establishing and
advertising it. In this email we explain what the convention is today
and what purpose it serves.

So this email is not meant to advertise the convention, right? Because 
the audience of this mailing list certainly doesn't include crawler 

(*) There is already another convention[2] for bots that EDIT Wikimedia

[2] https://www.mediawiki.org/wiki/Manual:Bots

I support this page was linked because from there one can click 
[[API:Client code]] > 
https://www.mediawiki.org/wiki/API:Etiquette#User-Agent_header > 
https://meta.wikimedia.org/wiki/User-Agent_policy which is the mentioned 


Analytics mailing list

Re: [Analytics] Multimedia data being crunched, expanded - first look

2016-01-25 Thread Federico Leva (Nemo)

Mark Holmquist, 25/01/2016 15:58:

You can find the graphs here:


At the default size, I only see 2010 in the x axis. Is there a way to 
reduce scale without zooming out? Only at 33 % zoom I manage to see 
2015, and that's not very readable.


Analytics mailing list

Re: [Analytics] Video view stats

2016-01-17 Thread Federico Leva (Nemo)

Andrew Gray, 17/01/2016 22:36:

Am I right in saying that therefore this means:

Browser requests are not easy. See 

For the purposes of your blog post, it's probably best to only consider 
columns 4 and 17 (see 
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts ).


Analytics mailing list

Re: [Analytics] timestamp wiki pagecounts

2015-12-24 Thread Federico Leva (Nemo)

Maurice Vergeer, 24/12/2015 10:16:

I am looking at your pagecounts as archived on
Can you tell me from what timezone the time stamps originate?

Any and all timestamps in dumps.wikimedia.org are in UTC. Apparently 
this is not as obvious as generally thought, so I've added a note: 


Analytics mailing list

Re: [Analytics] Inconsistent user IDs between EventLogging and main database

2015-12-16 Thread Federico Leva (Nemo)

Neil P. Quinn, 16/12/2015 21:40:

Does anyone know what's going on? Is this issue documented anywhere?

There is already at least one phabricator report IIRC.


Analytics mailing list

Re: [Analytics] Goals of analytics team for next quarter

2015-12-16 Thread Federico Leva (Nemo)

Nuria Ruiz, 16/12/2015 18:42:

TL:DR: We are going to mostly work on replacing reports on

Do you mean traffic reports? See also 
https://phabricator.wikimedia.org/T107175#1498819 and edit the task 
summary there please.

Is https://phabricator.wikimedia.org/T118329 a duplicate?


Analytics mailing list

Re: [Analytics] How many times has a video been played?

2015-12-15 Thread Federico Leva (Nemo)

Dan Andreescu, 15/12/2015 03:43:

Or python if that's easier.

is very easy to use. Download from dumps.wikimedia.org is tragically 
slow, making any one-time analysis impractical, but 
/data/scratch/tmp/mediacounts on Labs has a copy of October data.


Analytics mailing list

Re: [Analytics] Data collection

2015-12-14 Thread Federico Leva (Nemo)

Erik Zachte, 14/12/2015 14:14:

I can run similar reports for earlier months.

Thanks for publishing that code too! 


Analytics mailing list

Re: [Analytics] Readership metrics for the fortnight until December 6, 2015

2015-12-14 Thread Federico Leva (Nemo)

Interesting country breakdown!

Tilman Bayer, 14/12/2015 12:32:

For the top three, I looked at how pageviews developed on a daily basis
during the last three month including the week after this large change
(until Dec 6):

In Greece, the +21.6% rise was the result of an isolated spike from
November 23-25. This can be traced to a single page on the Greek
Wiktionary which on most days before and after only saw a single-digit
number of pageviews, but on these three days received more than 2.8
million: τάλε κουάλε
It’s about an expression that apparently comes from Latin via Italian
(“tale quale”) and means
something like “exactly the same” or “spitting image”. From the form of
the spike, it was likely not the result of actual human interest, rather
an undetected bot trying to learn exactly the same about exactly the same.

In Ireland, the -20.6% drop marked the end of a plateau whose start had
actually shown up in the report for the week until November 1
where the country was the top changer with a 40.2% rise.

For South Africa, the -20.6% drop does not form part of a clear pattern.

Analytics mailing list

Re: [Analytics] Preliminary goals for analytics infrastructure team

2015-12-03 Thread Federico Leva (Nemo)

Jon Katz, 03/12/2015 06:16:

Pywik up and running for iOS (simple machine spin-up, if not finished in Q2)

What is meant here by "Pywik"? Piwik? Some shortening of pywikibot? Other?


Analytics mailing list

Re: [Analytics] Confusing pageviews

2015-12-02 Thread Federico Leva (Nemo)

Oliver Keyes, 02/12/2015 18:52:

Via Brian Davis we find out the responsible patch is



Analytics mailing list

Re: [Analytics] Backlinks TO Wikipedia

2015-12-01 Thread Federico Leva (Nemo)

Edison Nica, 29/11/2015 16:56:

how many non-wikipedia pages point to a certain wikipedia page

I guess the only way we have to know this (other than grepping request 
logs for referrers, which would be quite a nightmare) is to access the 
Google Webmaster account for wikipedia.org (to which a couple employees 
had access, IIRC).


Analytics mailing list

Re: [Analytics] Commons Alexa rank drop in May 2015

2015-11-18 Thread Federico Leva (Nemo)

Dan Andreescu, 18/11/2015 22:32:

But that was explained to me as "we started filtering spiders better".
So I don't think that would affect Alexa's numbers but maybe it's a bad
coincidence that around the same time something else happened that
dropped the numbers.  And the convolution made us all miss it.

Or maybe the Alexa stats contained the same filtering error and fixed it 
at the same time as us! Sounds unlikely.


Analytics mailing list

[Analytics] Commons Alexa rank drop in May 2015

2015-11-18 Thread Federico Leva (Nemo)
Anything meaningful in the drop from #200 to #240 position in global 
Alexa rank? http://www.alexa.com/siteinfo/commons.wikimedia.org
We know Alexa has many deficiencies, can HTTPS/HSTS have disrupted their 


Analytics mailing list

Re: [Analytics] Does StackExchange have more monthly active users than Wikipedia?

2015-11-13 Thread Federico Leva (Nemo)

Timo Tijhof, 14/11/2015 01:38:

StackOverflow's recent blog post about renaming their organisation does
make an interesting claim though.


 > The [Stack Exchange] network as a whole has more monthly 5-time
posters than English Wikipedia has 5-time monthly editors.

Yes, that's the explanation given in the question I linked: 
http://meta.stackexchange.com/a/269344/248268 See there for open 
questions on how this number was calculated.


Analytics mailing list

[Analytics] Does StackExchange have more monthly active users than Wikipedia?

2015-11-13 Thread Federico Leva (Nemo)
Some information at 

TL;DR: not really, and definitely not StackOverflow alone (~14k). But 
perhaps the whole StackExchange has more than the English Wikipedia alone.


Analytics mailing list

Re: [Analytics] Pagecounts strange record

2015-11-03 Thread Federico Leva (Nemo)

Giacomo Marangoni, 31/10/2015 17:55:

Sometimes I found record like this  “it.n Addio_al_regista_Sydney_Pollack 1 0” 
and I can’t explain myself how a page could be visited one time and turn back a 
response of 0 byte.

Did you check whether the page existed at the time? pagecounts-raw also 
records 404 aka red links. (Visited e.g. by the user who then creates 
the page.)


Analytics mailing list

Re: [Analytics] Special:Log/move and Special:NewPages

2015-10-30 Thread Federico Leva (Nemo)


Analytics mailing list

Re: [Analytics] How is "article" defined in Special:Statistics?

2015-10-28 Thread Federico Leva (Nemo)



Analytics mailing list

Re: [Analytics] [Spam] Re: User statistics for video marking ENWP 5m article milestone

2015-10-27 Thread Federico Leva (Nemo)

Jonathan Morgan, 27/10/2015 18:53:

Either way, it's safe to say that the total number is in the millions.

+1. It's correct to say that millions have edited Wikipedia, and 
probably editors for Wikimedia projects are in the order of 10^7. There 
is no information gain in trying to give more precise numbers.

The point here is very simple, Wikimedia wikis are the most massively 
multi-author work ever created. Sure, the number may be misleading if 
associated to different claims.


Analytics mailing list

Re: [Analytics] Canonical location for metrics documentation

2015-10-13 Thread Federico Leva (Nemo)

Neil P. Quinn, 14/10/2015 02:30:

We currently have metrics documentation in two different places

What sort of documentation do you have in mind? Meta has the definitions 
which WMF hopes to see used in other fields as well, while MediaWiki.org 
and wikitech have technical documentation about stats.wikimedia.org and 
other stuff produced by Analytics.


Analytics mailing list

Re: [Analytics] Using wikimedia pagehits and other data for creating statistics on exposure to culture

2015-10-11 Thread Federico Leva (Nemo)

Dear Albrecht,
as an avid Eurostat consumer let me congratulate you for a very 
interesting project, which will be of great interest not only to the 
general population but also to direct consumers of Eurostat such as

* Europeana,
* the European Commission (for policy implications),
* EU governments (some of which supported Wiki Loves Monuments, e.g. for 

* the Council of Europe (for their support for Wiki Loves Monuments).

Dan already pointed out what will be the best route for your to pursue 
in the future, but let me point out what data and tools people have 
needed and made available so far:
* for recent raw data, 
https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites ;
* global usage of World Heritage Sites images: 190k usages for 24k 
images (via 
* pageviews in a wiki (e.g. English Wikipedia) for all articles 
classified as world heritage sites on Wikidata: says ~3 M/month for ~800 
articles (via WDQ claim[1435:9259] in Listeria > PagePile > TreeViews 

* for Europeana, https://meta.wikimedia.org/wiki/Europeana/Stats ;
* for WLM images accesses in all languages, e.g. 
http://tools.wmflabs.org/glamtools/baglama2/#gid=60&month=201508 ;
* for multilingual pageviews of an individual topic, e.g. 


albrecht.wirthm...@ec.europa.eu, 06/10/2015 11:35:

Dear Wikimedia Analytics team,
My name is Albrecht Wirthmann. I am working in a task team Big Data at
Eurostat. This is the European statistical office, which makes part of
the European Commission. The task team is exploring new data sources
for their feasibility of producing official statistics. We have been
looking at various internet data sources including Wikimedia. The idea
that we are currently following up is identifying Wikipedia pages in
English that are referring to World Heritage sites and to analyse the
number and development of page views of those pages as an indicator of
exposure to culture. For this purpose we downloaded the page views files
from _http://dumps.wikimedia.org/other/pagecounts-ez/_. The data should
be later on included in a pocket book showing statistics on culture in
the European Union.
I am contacting you to make you aware of our intentions, to ask if there
would be any concerns related to our project and to possibly have a chat
with you and your team to ask some technical questions and about the
possibility of getting some additional data. We would be interested in
page hits by country in order to be more specific on the statistics that
we would compile.
We would be very glad about a positive reply and remain at your disposal,
Kind regards,
Albrecht Wirthmann
TF Big Data
BECH building
5, rue Alphonse Weicker
L 2721 Luxembourg

Analytics mailing list

Re: [Analytics] Editor retention in Wikidata

2015-10-04 Thread Federico Leva (Nemo)

Emw, 04/10/2015 20:24:

as well as the ticket tracking work on a Wikidata analytics dashboard
[3].  However, after searching there and other usual places [4, 5, 6], I
have not found a chart suitable for an "apples-to-apples" comparison
between Wikidata and Wikipedia along the lines of Howie's famous graph.

You should start including Meta-Wiki in your "usual places". ;-)
* https://meta.wikimedia.org/wiki/Research:Editor_retention
** https://meta.wikimedia.org/wiki/Research:Surviving_new_editor (note: 
survival ~2 months later, not 1 year)

** https://meta.wikimedia.org/wiki/Category:Editor_engagement


Which is a slightly different way to look at the matter but tells the 
same story. Is that good enough, or do you necessarily want to replicate 
the same old graph?

Note that I'm not sure apple-to-apple comparisons between Wikidata and 
Wikipedia are possible at all. For instance, does a sitelink edit 
performed from the sitelink dialog on a client wiki count as activation?


Analytics mailing list

Re: [Analytics] corrupted and missing log files

2015-09-14 Thread Federico Leva (Nemo)
Users also keep a list in the stats.grok.se FAQ: 


Analytics mailing list

Re: [Analytics] Users changing language version through interwiki links

2015-09-12 Thread Federico Leva (Nemo)

Strainu, 12/09/2015 14:43:

Would it be possible to track the number of users changing language version
in each article? Like: on date X, Y users visited a.wikipedia.org and Z left
to go to b.wikipedia.org, T left for c.wikipedia.org etc.

Indeed! We've been promised this data release already at the time of the 
2010 ClickTracking. :)

If possible, is there interest (aka who do I have to bribe) to implement that
as a publicly-available dump/site?

ccan be made cross-wiki at some point.


Analytics mailing list

Re: [Analytics] Breakdown of unique visitors by country (and by project)

2015-09-08 Thread Federico Leva (Nemo)

Cristian Consonni, 08/09/2015 19:42:

we (Wikimedia Italia) are starting writing a proposal for a EU project
(in the Horizon 2020 framework) and our partners asked us for "numbers
to quantify the readership of Wikipedia in the various languages
interested by the project".

Traditionally, for this purpose Wikimedia Italia used the Nielsen 
NetRatings, which AFAIK are/were available for several European 
countries. I can't easily find them in their new websites though.


Analytics mailing list

[Analytics] Anyone using the user_daily_contribs table/API?

2015-09-05 Thread Federico Leva (Nemo)

See https://phabricator.wikimedia.org/T85984

The user_daily_contribs table (and associated API) is sometimes used for
* JavaScript (e.g. CentralNotice) targeting users based on activity in a 
certain timeframe,

* simplification of SQL queries (e.g. [1]),
* other?

If you use this data/feature or plan to use it, or if you replaced it 
with something else, your comment on the task is particularly welcome to 
assess whether to keep it.



Analytics mailing list

Re: [Analytics] Vital Signs dashboard

2015-08-24 Thread Federico Leva (Nemo)

Neil P. Quinn, 25/08/2015 01:58:

Sorry—it turns out that this is a browser bug! All of the graphs except
legacy pageviews display no lines at all in Firefox (I've tested 42.0a2
and 41.0b3). I really should have checked that first.

I'll file it in Phab; hopefully, you can take a look at some point.

It's already filed, please add/edit: 


Analytics mailing list

Re: [Analytics] pageviews_hourly table

2015-08-23 Thread Federico Leva (Nemo)

Tilman Bayer, 22/08/2015 19:33:

And I know that other issues were caught by ErikZ's proactive vigilance,
which will need to find an equivalent in the upcoming replacement for



Analytics mailing list

Re: [Analytics] [Wikimedia-search] Scaleable Event Systems recap

2015-08-04 Thread Federico Leva (Nemo)

Oliver Keyes, 04/08/2015 00:12:

a lot less cautious about our sampling

A bit, perhaps, not a lot. Sampling is not just a performance matter.


Analytics mailing list

Re: [Analytics] proposal to axe current traffic reports

2015-07-24 Thread Federico Leva (Nemo)

Erik Zachte, 24/07/2015 18:59:

I think the time has come to disable the traffic reports based on
webstatscollector (2.0) data.


Only the breakdowns by client? All the breakdowns? All the pageview stats?
The country data is very important, for instance: people often ask such 
numbers (at least in Italy); nobody is ever looking at all of them in 
detail, so it's important for i18n etc. that they are available for 
everyone to look at their corner.


Analytics mailing list

Re: [Analytics] Request for three viewership statistics

2015-07-07 Thread Federico Leva (Nemo)

Pine W, 07/07/2015 02:29:

(2) During the past 90 days or so, how many unique users have viewed
on the various Wikimedia pages where it's included?

(2) During the past 90 days or so, how many times has
been viewed on the various Wikimedia pages where it's included?



Analytics mailing list

Re: [Analytics] rsync from stat1002 broken

2015-06-22 Thread Federico Leva (Nemo)

Oliver Keyes, 22/06/2015 22:51:


(Presumed context: http://searchdata.wmflabs.org/ . Thanks 


Analytics mailing list

Re: [Analytics] [Wikimedia-l] Wikipedia article per speaker

2015-06-13 Thread Federico Leva (Nemo)

Asaf Bartov, 13/06/2015 02:42:

The (already existing) metric of active-editors-per-million-speakers is,
it seems to me, a far more robust metric.  Erik Z.'s stats.wikimedia.org
 is offering that metric.

I personally agree on this in general, but Millosh is trying something 
different in his current quest, i.e. content ingestion and content 
coverage assessment, also for missing language subdomains. (By the way, 
I created the category, please add stuff: 
https://meta.wikimedia.org/wiki/Category:Content_coverage .)

Mere article count tells us very little and he acknowledged it. As you 
added analytics: maybe when https://phabricator.wikimedia.org/T44259 is 
fixed we can also do fancy things like join various tables and count 
(countable) articles above a minimum threshold of hits, or something 
like that.

Oh, and the total number of internal links in a wiki is also an 
interesting metric in many cases: they're often a good indicator of how 
curated a wiki globally is, while bot-created articles are often orphan. 
(Locally there might be overlinking but that's rarely a wiki-wide 
issue.) I don't remember how reliable the WikiStats numbers are, but 
they often give a good clue already.


Analytics mailing list

Re: [Analytics] deletion of newly created articles

2015-05-31 Thread Federico Leva (Nemo)
You can presumably replicate 
https://meta.wikimedia.org/wiki/Research:Wikipedia_article_creation for 
other languages with the code provided, if you're interested in more 
than the 8 or so biggest Wikipedias.

Or you can ask the speedy deletion Wikias: 


Analytics mailing list

Re: [Analytics] Analytics needs -- Piwik for ru.wikimedia.org

2015-05-31 Thread Federico Leva (Nemo)

Oliver Keyes, 27/05/2015 18:04:

Well, the pageviews data will very deliberately/not/  contain any data
from the chapters' wikis.

Will? When? Why? Makes no sense. Filed a bug: 


Analytics mailing list

Re: [Analytics] article creation stuck in February

2015-04-30 Thread Federico Leva (Nemo)

Amir E. Aharoni, 30/04/2015 14:37:

The article creation tables were last updated for February:

Apparently it's stuck on the lack of a fa.wiki dump:
* https://stats.wikimedia.org/WikiCountsJobProgress.html
* http://www.infodisiac.com/cgi-bin/WikimediaDownload.pl


Analytics mailing list

Re: [Analytics] Rough estimate of percentage of requests without Javascript enabled/capable clients

2015-03-25 Thread Federico Leva (Nemo)

Timo Tijhof, 25/03/2015 03:11:

I honestly have no clue how popular our www-portals are. I'd be
interested in seeing some stats on that.

WikiStats has long had such statistics. Portals are about 3 % of total 
page views, but much more in some countries.


Analytics mailing list

Re: [Analytics] [Announce] New daily feed: media file request counts

2015-03-25 Thread Federico Leva (Nemo)

Hay (Husky), 25/03/2015 11:03:

Answering my own question: until somebody puts up a stats.grok.se-like
interface for the mediacounts, i've hacked together a Python script
that can be used to 'query' the TSV files with a file, or a list of


And I sent a small silly patch to give a category name like 
https://commons.wikimedia.org/wiki/Category:Media_from_BEIC as input. 
Example output attached for the lazy.

Some data I found particularly interesting:
1) the sum of columns 11–14 (big thumbs),
2) the ratio between (1) and column 3 (total transfers),
3) column 24 (no Wikimedia referrer).
	Total transfers in this small sample seem even higher than pageviews. 
(1) counts thumbs above 400 pixels, which are usually not embedded by 
default: (2) should tell how many users probably clicked or did 
something else. (3) may indicate which files "went viral".


Description: application/vnd.oasis.opendocument.spreadsheet
Analytics mailing list

Re: [Analytics] US Gov released request log datasets today

2015-03-19 Thread Federico Leva (Nemo)

Jeremy Baron, 19/03/2015 20:59:

I wonder what their pageview definition is.:)

Whatever Google Analytics uses, it seems? :/ 


Analytics mailing list

Re: [Analytics] Relevant Content Availability

2015-03-17 Thread Federico Leva (Nemo)

Abdel Samad, Rawia, 21/01/2015 09:47:

I work for a consulting firm called Strategy&. We have been engaged by
Facebook on behalf of Internet.org to conduct a study on assessing the
state of connectivity globally. One key area of focus is the
availability of relevant online content. We are using a the availability
of encyclopedic knowledge in one’s primary language as a proxy for
relevant content. We define this as 100K+ Wikipedia articles in one’s
primary language.

Hello Rawia,
is there any update on this project? Have you contacted Google about 
similar "content availability" and "content ingestion" activities they 
conducted in the past, also related to machine translation 
(https://meta.wikimedia.org/wiki/Machine_translation )?

We are very interested in this sort of initiatives (see also 
), but experience taught us that looking at the wrong things can have 
terrible consequences.


We have a few questions related to this analysis prior
to publishing it:

·We are currently using the article count by language based on
Wikimedia’s foundation public link: Source:
http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable
source for article count – does it include stubs?

·Is it possible to get historic data for article count. It would be
great to monitor the evolution of the metric we have defined over time?

·What are the biggest drivers you’ve seen for step change in the number
of articles (e.g., number of active admins, machine translation, etc.)

·We had to map Wikipedia language codes to ISO 639-3 language codes in
Ethnologue (source we are using for primary language data). The 2
language code for a wikipedia language in the “List of Wikipedias”
sometimes matches but not always the ISO 639-1 code. Is there an easy
way to do the mapping?

Many Thanks,


Analytics mailing list

Re: [Analytics] [Technical] final pageviews QA

2015-03-13 Thread Federico Leva (Nemo)
On February 11, 
https://www.mediawiki.org/wiki/MediaWiki_1.25/wmf16#CentralNotice was 
deployed. The deprecation/decrease of Special:BannerRandom and 
Special:RecordImpression can easily justify a decrease of hundreds 
millions requests. Probably wikistats was already filtering them.

Erik Zachte, 13/03/2015 00:41:

That real legacy definition, with all of its known deficiencies, is what will 
matter for our veteran users and any discrepacy from there needs explaining.

I agree that what's needed here is a graph comparing current wikistats 
(and reportcard?) to future output with the new figures. The new 
definition is not final until it's "live". :)

I have troubles understanding the current and expected distribution 
pipeline of pageviews data and figures. It would be nice to have an 
overview somewhere, even just a list of links to the different steps. 


Analytics mailing list

Re: [Analytics] [Cluster] Monitoring the impact Hive jobs have on the Analytics cluster

2015-03-07 Thread Federico Leva (Nemo)

Christian Aistleitner, 07/03/2015 15:14:

P.S.: The above URL has diagrams! Click the URL!

And with colours! So it's like checking heartbeats, cute. :)


Analytics mailing list

Re: [Analytics] Odd data in dumps

2015-03-01 Thread Federico Leva (Nemo)
Getting aggregation right is hard. :) The only wiki I know whose counts 
are surely wrong is wmfwiki: https://phabricator.wikimedia.org/T51266


Analytics mailing list

Re: [Analytics] [Wiki-research-l] [Release]

2015-02-25 Thread Federico Leva (Nemo)

Erik Zachte, 25/02/2015 23:34:

Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/  and

Ironholds' looks more vulnerable to bots, it's easier to see in small 
wikis (though, kudos! many more small wikis are included than in 
wikistats). For instance, 20 more percentage points for USA on Breton 
and Bavarian Wikipedias, 30 on Welsh, 40 on Alemannic, almost 50 on 
Kurdish. For Chinese bots they look similar, though in some cases I'm 
not sure what's going on: for instance als.wiki also sees CH and RO emerge.

Will the new pageviews definition use the same bot filtering method?


Analytics mailing list

[Analytics] Fwd: Reasons you use the XML dumps or want to, but can't?

2015-02-20 Thread Federico Leva (Nemo)


 Messaggio inoltrato 
Oggetto:[Xmldatadumps-l] Your comments needed (long term dumps rewrite?)
Data:   Thu, 19 Feb 2015 12:30:01 +0200
Mittente:   Ariel Glenn WMF 
A:  xmldatadump...@lists.wikimedia.org

The MediaWiki Core team has opened a discussion about getting more
involved in and maybe redoing the dumps infrastructure.  A good starting
point is to understand how folks use the dumps already or want to use
them but can't, and some questions about that are listed here:

I've added some notes but please go weigh in.  Don't be shy about what
you do/what you need, this is the time to get it all on the table.


Xmldatadumps-l mailing list
Analytics mailing list

Re: [Analytics] stats.grok.se not updating

2015-02-12 Thread Federico Leva (Nemo)

Dan Andreescu, 12/02/2015 18:55:

Sounds like we need to file a ticket, cc-ing Andrew and Ariel directly.
Guys, Henrik's having bad network speed getting data from dumps.

* https://phabricator.wikimedia.org/T45647
* https://gerrit.wikimedia.org/r/#/c/189447/


Analytics mailing list

  1   2   >