Nuria added a comment.
The work done by @mpopov The wbc_entity_usage table is supposed to hold info on Wikidata usage for the pages For example, here's a random file I added some structured data to a few days ago: https://commons.wikimedia.org/wiki/File:P%C3%B3voa_de_Varzim_-i---i-_(25379025808).jpg When you look for it the commonswiki replica, it has a page ID of 68860692. Looking for it in the wbc_entity_usage table we only see that it has a caption in English, which I added at basically the same time as several statements: Screen Shot 2019-11-19 at 5.58.57 PM.png eu_aspect column does have other values like "O" (statements) and "D" (not documented, but from a brief investigation looks like it's specifically for linking categories on Commons to Wikidata Q-items). There are some records of files with "O" aspects (as the MW page notes, it can refer to a variety to entity usages but typically it's statements) but then it gets weird because the language of the label isn't recorded and there's a bunch of seemingly unnecessary info? Take for example the MediaWiki DB data for https://commons.wikimedia.org/wiki/File:Jodrell_Bank_Mark_II_5.jpg Screen Shot 2019-11-19 at 6.36.04 PM.png Screen Shot 2019-11-19 at 6.36.33 PM.png Woof! That's…not great. So, uh, clearly there's something funky going on with the Wikibase client extension? Or maybe that's data that was recorded by an earlier version of the extension before it knew to append language codes to labels? I don't know enough about the nitty-gritty there, so these are just vaguely educated guesses. So, with the caveat that we're dealing with imperfect data and that some data is definitely missing, here are SOME lower bounds on files which have structured data: SELECT COUNT(DISTINCT eu_page_id) AS n_files_with_statements, COUNT(1) AS n_statements FROM wbc_entity_usage INNER JOIN page ON wbc_entity_usage.eu_page_id = page.page_id WHERE page_namespace = 6 AND eu_aspect = 'O'; n_files_with_statements: 5 212 680 n_statements: 8 677 888 Files with labels: SELECT COUNT(DISTINCT eu_page_id) AS n_files_with_captions FROM wbc_entity_usage INNER JOIN page ON wbc_entity_usage.eu_page_id = page.page_id WHERE page_namespace = 6 AND eu_aspect RLIKE '^L'; n_files_with_captions: 7 857 520 Files with captions and/or statements: SELECT COUNT(DISTINCT eu_page_id) AS n_files_with_structured_data FROM wbc_entity_usage INNER JOIN page ON wbc_entity_usage.eu_page_id = page.page_id WHERE page_namespace = 6 AND (eu_aspect RLIKE '^L' OR eu_aspect = 'O'); n_files_with_structured_data: 7 861 887 Total files (at the moment I ran these queries, so we have a record of % w/ structured data): SELECT COUNT(1) AS n_files_total FROM page WHERE page_namespace = 6; n_files_total: 58 698 512 Thanks for reading! Feel free to correct me if I made any mistakes with the queries or my thought process. TASK DETAIL https://phabricator.wikimedia.org/T238878 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Nuria Cc: kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, E1presidente, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, Base, matthiasmullie, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs