mpopov added a comment.

Okay, here are the numbers which were calculated with the following conditions:

  • Using the December 2018 snapshot of MediaWiki History in the Data Lake
  • Only files which have not been deleted are counted
  • Only revisions to the metadata which were not reverted AND which were not reverts AND which were not deleted
  • "Metadata augmented w/in 1st 2mo" means there was at least 1 byte-adding revision to the file's page within the first 60 days after creation

It looks like the baseline for % of files which have metadata added within the first 2 months is 45.6% overall:

Files since 2003Metadata augmented w/in 1st 2mo (60d)Proportion
52,640,74624,003,59345.599%

Here are the final numbers:

YearFiles uploaded that yearMetadata augmented w/in 1st 2mo (60d)Proportion
200417,6699,42353.331%
2005265,976108,44940.774%
2006648,025228,23035.219%
20071,205,884371,72930.826%
20081,403,480576,98741.111%
20091,927,836822,06142.642%
20102,333,372863,58837.010%
20113,884,6351,287,97233.156%
20123,490,9051,589,17345.523%
20134,591,2722,007,54743.725%
20144,715,3232,215,43746.984%
20155,683,9662,990,53552.614%
20166,312,0672,921,21446.280%
20178,182,2363,623,89744.290%
20187,978,0994,387,35154.992%
MonthFiles uploaded that monthMetadata augmented w/in 1st 2mo (60d)Proportion
January 2018652,863322,24649.359%
February 2018705,945399,70956.620%
March 2018784,484358,70345.725%
April 2018609,520276,23045.319%
May 2018714,875414,76558.019%
June 2018588,235363,86361.857%
July 2018650,022409,26162.961%
August 2018783,718515,03765.717%
September 2018817,719436,63253.396%
October 2018563,806296,13552.524%
November 2018573,655363,01763.281%
December 2018533,257231,75343.460%

Appendix

USE wmf;
WITH page_creation_timestamps AS (
  -- since page_creation_timestamp in mediawiki_history table is wrong:
  SELECT
    page_id,
    event_timestamp AS upload_timestamp
  FROM mediawiki_history
  WHERE snapshot = '${snapshot}'
    AND wiki_db = 'commonswiki'
    AND event_entity = 'revision'
    AND page_namespace = 6
    AND revision_parent_id = 0
    AND NOT revision_is_identity_revert -- don't count edits that are reverts
    AND NOT revision_is_identity_reverted -- don't count edits that were reverted
    AND NOT revision_is_deleted -- don't counts edits moved to archive table
    AND page_id IS NOT NULL -- don't count deleted files
), fixed_revision_history AS (
  SELECT
    page_creation_timestamps.page_id AS page_id,
    upload_timestamp,
    event_timestamp AS revision_timestamp,
    revision_parent_id,
    revision_text_bytes_diff
  FROM page_creation_timestamps
  LEFT JOIN mediawiki_history ON (
    page_creation_timestamps.page_id = mediawiki_history.page_id
    AND mediawiki_history.snapshot = '${snapshot}'
    AND mediawiki_history.wiki_db = 'commonswiki'
    AND NOT mediawiki_history.revision_is_identity_revert -- don't count edits that are reverts
    AND NOT mediawiki_history.revision_is_identity_reverted -- don't count edits that were reverted
    AND NOT mediawiki_history.revision_is_deleted -- don't counts edits moved to archive table
  )
), summarized_revisions AS (
  SELECT
    page_id, TO_DATE(upload_timestamp) AS creation_date,
    COUNT(1) AS n_edits,
    SUM(IF(revision_parent_id > 0, 1, 0)) as n_later_edits,
    SUM(IF(revision_text_bytes_diff > 0 AND DATEDIFF(revision_timestamp, upload_timestamp) <= 60 AND revision_parent_id > 0, 1, 0)) AS n_additions_2mo
  FROM fixed_revision_history
  GROUP BY page_id, TO_DATE(upload_timestamp)
)
SELECT
  creation_date,
  COUNT(1) AS n_uploaded, -- files uploaded
  SUM(IF(n_later_edits > 0, 1, 0)) AS n_later_edited, -- files whose pages were edited after upload
  SUM(IF(n_additions_2mo > 0, 1, 0)) AS n_added_to_2mo -- files that have had metadata added after creation and in first 2 months
FROM summarized_revisions
GROUP BY creation_date;

TASK DETAIL
https://phabricator.wikimedia.org/T213597

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mpopov
Cc: Neil_P._Quinn_WMF, chelsyx, MNeisler, mpopov, kzimmerman, Ramsey-WMF, Abit, JKSTNK, Lahi, PDrouin-WMF, E1presidente, Cparle, Anooprao, SandraF_WMF, Tramullas, Acer, Silverfish, Susannaanas, Jane023, Wikidata-bugs, Base, matthiasmullie, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to