Milimetric has submitted this change and it was merged. Change subject: Fix computation of top-ten countries ......................................................................
Fix computation of top-ten countries When computing the top-ten countries for a project, countries have been ordered only by the counts for single days. The first ten distinct countries were considered the top ten countries. Hence, countries whose IPs had many edits on a single day once, but not so many in total were favored over countries whose IPs had fewer edits over a longer period with a higher total. In connection with using DISTINCT and LIMIT, this gave inconsistent results. So for example for sgwiki, gave 'Netherlands' (maximum count 4) before 'United States' (maximum count 5). We now switched to proper GROUP BY and ordering by the sum of counts. The query did not force a consistent way of cutting ties. We now cut ties, by the end date (descending), and finally on the country name. The query did not limit to rows for the 30 day window. So rows for different periods could lead to wrongly favor a country. We now limit to rows of the 30 day window. Change-Id: I1731c8104b03b5ee92367b41b9e2887da2041f8e --- M scripts/make_limn_files.py 1 file changed, 10 insertions(+), 6 deletions(-) Approvals: Milimetric: Verified; Looks good to me, approved diff --git a/scripts/make_limn_files.py b/scripts/make_limn_files.py index cefa3bb..6e422bc 100755 --- a/scripts/make_limn_files.py +++ b/scripts/make_limn_files.py @@ -112,15 +112,19 @@ limn_name = '%s Editors by Country (top %d)' % (proj.upper(), k) if sql.paramstyle == 'qmark': - top_k_query = """SELECT DISTINCT(country) + top_k_query = """SELECT country FROM erosen_geocode_active_editors_country - WHERE project=? AND cohort='all' - ORDER BY count DESC LIMIT ?""" + WHERE project=? AND cohort='all' AND end = start+INTERVAL 30 day + GROUP BY country + ORDER BY SUM(count) DESC, end DESC, country + LIMIT ?""" elif sql.paramstyle == 'format': - top_k_query = """SELECT DISTINCT(country) + top_k_query = """SELECT country FROM erosen_geocode_active_editors_country - WHERE project=%s AND cohort='all' - ORDER BY count DESC LIMIT %s""" + WHERE project=%s AND cohort='all' AND end = start+INTERVAL 30 day + GROUP BY country + ORDER BY SUM(count) DESC, end DESC, country + LIMIT %s""" logger.debug('top k query: %s', top_k_query % (proj, k)) cursor.execute(top_k_query, (proj, k)) # mysqldb first converts all args to str top_k = map(itemgetter('country'), cursor.fetchall()) -- To view, visit https://gerrit.wikimedia.org/r/85626 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: merged Gerrit-Change-Id: I1731c8104b03b5ee92367b41b9e2887da2041f8e Gerrit-PatchSet: 1 Gerrit-Project: analytics/geowiki Gerrit-Branch: master Gerrit-Owner: QChris <christ...@quelltextlich.at> Gerrit-Reviewer: Milimetric <dandree...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits