Neil_P._Quinn_WMF added a comment.

Here's the query I used, which I would like someone in #product-analytics (e.g. @chelsyx and @Neil_P._Quinn_WMF) to review:

Sure thing!

I noticed once big thing: it seems like your counts of file page edits (n_edits_total, n_additions_total, etc.) include the initial edit that creates the pages, so in the end you're getting the proportion of files which have metadata added in the first 2 months, including during the initial upload.

I tried excluding those initial creations (event_timestamp != page_creation_timestamp), and it looks like the proportion goes from 99% to 50%.

Query excluding intial creations

WITH summarized_revisions AS (
  SELECT
    page_id, TO_DATE(page_creation_timestamp) AS creation_date,
    COUNT(1) AS n_edits, -- not including reverts or reverted
    SUM(IF(event_timestamp != page_creation_timestamp, 1, 0)) as n_later_edits,
    SUM(IF(revision_text_bytes_diff > 0 AND DATEDIFF(event_timestamp, page_creation_timestamp) <= 60 AND event_timestamp != page_creation_timestamp, 1, 0)) AS n_additions_2mo
  FROM wmf.mediawiki_history
  WHERE snapshot = '2018-12'
    AND wiki_db = 'commonswiki'
    AND page_creation_timestamp between "2018-10-01" and "2018-10-08"
    AND event_entity = 'revision'
    AND page_namespace = 6
    AND NOT revision_is_identity_revert -- don't count edits that are reverts
    AND NOT revision_is_identity_reverted -- don't count edits that were reverted
    AND NOT revision_is_deleted -- don't counts edits moved to archive table
    AND page_id IS NOT NULL -- don't count deleted files
  GROUP BY page_id, TO_DATE(page_creation_timestamp)
)
SELECT
  creation_date,
  COUNT(1) AS n_uploaded, -- files uploaded
  SUM(IF(n_later_edits > 0, 1, 0)) AS n_later_edited, -- files whose pages were edited after upload
  SUM(IF(n_additions_2mo > 0, 1, 0)) AS n_added_to_2mo -- files that have had metadata added after creation and in first 2 months
  FROM summarized_revisions
GROUP BY creation_date;

creation_daten_uploadedn_later_editedn_added_to_2mo
2018-10-01233901330710248
2018-10-0218226113088947
2018-10-03227631680312142
2018-10-0417455128969088
2018-10-05173211139710261
2018-10-06201911245610558
2018-10-0721479115759853

Other comments

WITH summarized_revisions AS (
  SELECT
    page_id, TO_DATE(page_creation_timestamp) AS creation_date,
    COUNT(1) AS n_edits_total, -- not including reverts or reverted

I think this includes uploads of new file versions, not just metadata edits, but I don't think it would change the results much.

  SUM(IF(revision_text_bytes_diff > 0, 1, 0)) AS n_additions_total,
  SUM(IF(DATEDIFF(event_timestamp, page_creation_timestamp) <= 60, 1, 0)) AS n_edits_2mo,
  SUM(IF(revision_text_bytes_diff > 0 AND DATEDIFF(event_timestamp, page_creation_timestamp) <= 60, 1, 0)) AS n_additions_2mo
FROM wmf.mediawiki_history
WHERE snapshot = '2018-12'
  AND wiki_db = 'commonswiki'
  AND event_entity = 'revision'
  AND page_namespace = 6
  AND NOT revision_is_identity_revert -- don't count edits that are reverts
  AND NOT revision_is_identity_reverted -- don't count edits that were reverted
  AND NOT revision_is_deleted -- don't counts edits moved to archive table
  AND page_id IS NOT NULL -- don't count deleted files

I don't understand the point of this, since the NOT revision_is_deleted should have already removed deleted files. (Also the page_id isn't necessarily null for deleted pages; after all the MediaWiki archive table has ar_page_id.)

 GROUP BY page_id, TO_DATE(page_creation_timestamp)
)
SELECT
 creation_date,
 COUNT(1) AS n_total, -- files uploaded
 SUM(IF(n_edits_total > 0, 1, 0)) AS n_edited, -- files that have had metadata edited
 SUM(IF(n_additions_total > 0, 1, 0)) AS n_added_to, -- files that have had metadata added
 SUM(IF(n_edits_2mo > 0, 1, 0)) AS n_edited_2mo, -- files that have had metadata edited in first 2 months
 SUM(IF(n_additions_2mo > 0, 1, 0)) AS n_added_to_2mo -- files that have had metadata added in first 2 months
 FROM summarized_revisions
GROUP BY creation_date;

TASK DETAIL
https://phabricator.wikimedia.org/T213597

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Neil_P._Quinn_WMF
Cc: Neil_P._Quinn_WMF, chelsyx, MNeisler, mpopov, kzimmerman, Ramsey-WMF, Abit, JKSTNK, Lahi, PDrouin-WMF, E1presidente, Cparle, Anooprao, SandraF_WMF, Tramullas, Acer, Silverfish, Susannaanas, Jane023, Wikidata-bugs, Base, matthiasmullie, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to