Milimetric has submitted this change and it was merged.

Change subject: Fix computation of top-ten countries
......................................................................


Fix computation of top-ten countries

When computing the top-ten countries for a project, countries have
been ordered only by the counts for single days. The first ten
distinct countries were considered the top ten countries. Hence,
countries whose IPs had many edits on a single day once, but not so
many in total were favored over countries whose IPs had fewer edits
over a longer period with a higher total. In connection with using
DISTINCT and LIMIT, this gave inconsistent results.

So for example for sgwiki, gave 'Netherlands' (maximum count 4) before
'United States' (maximum count 5).

We now switched to proper GROUP BY and ordering by the sum of counts.

The query did not force a consistent way of cutting ties. We now cut
ties, by the end date (descending), and finally on the country name.

The query did not limit to rows for the 30 day window. So rows for
different periods could lead to wrongly favor a country. We now limit
to rows of the 30 day window.

Change-Id: I1731c8104b03b5ee92367b41b9e2887da2041f8e
---
M scripts/make_limn_files.py
1 file changed, 10 insertions(+), 6 deletions(-)

Approvals:
  Milimetric: Verified; Looks good to me, approved



diff --git a/scripts/make_limn_files.py b/scripts/make_limn_files.py
index cefa3bb..6e422bc 100755
--- a/scripts/make_limn_files.py
+++ b/scripts/make_limn_files.py
@@ -112,15 +112,19 @@
     limn_name = '%s Editors by Country (top %d)' % (proj.upper(), k)
 
     if sql.paramstyle == 'qmark':
-        top_k_query = """SELECT DISTINCT(country)
+        top_k_query = """SELECT country
                     FROM erosen_geocode_active_editors_country
-                    WHERE project=? AND cohort='all'
-                    ORDER BY count DESC LIMIT ?"""
+                    WHERE project=? AND cohort='all' AND end = start+INTERVAL 
30 day
+                    GROUP BY country
+                    ORDER BY SUM(count) DESC, end DESC, country
+                    LIMIT ?"""
     elif sql.paramstyle == 'format':
-        top_k_query = """SELECT DISTINCT(country)
+        top_k_query = """SELECT country
                     FROM erosen_geocode_active_editors_country
-                    WHERE project=%s AND cohort='all'
-                    ORDER BY count DESC LIMIT %s"""
+                    WHERE project=%s AND cohort='all' AND end = start+INTERVAL 
30 day
+                    GROUP BY country
+                    ORDER BY SUM(count) DESC, end DESC, country
+                    LIMIT %s"""
         logger.debug('top k query: %s', top_k_query % (proj, k))
     cursor.execute(top_k_query, (proj, k)) # mysqldb first converts all args 
to str
     top_k = map(itemgetter('country'), cursor.fetchall())

-- 
To view, visit https://gerrit.wikimedia.org/r/85626
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I1731c8104b03b5ee92367b41b9e2887da2041f8e
Gerrit-PatchSet: 1
Gerrit-Project: analytics/geowiki
Gerrit-Branch: master
Gerrit-Owner: QChris <christ...@quelltextlich.at>
Gerrit-Reviewer: Milimetric <dandree...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to