Addshore added a comment.
In T238878#5684895 <https://phabricator.wikimedia.org/T238878#5684895>, @Nuria wrote: > @Addshore : disclaimer: I know next to nothing about this but how are you taking into account that the revision is the last one for the page? That is, a page might have had a structured data item in a prior revision and from its most current revision that structured data item is removed? In your select * i think* you are counting revisions with slots that contain data but if that revision is a past one for the page it should not be counted, as we only care about structured data present in the page right now, correct? Hopefully this makes sense, I think you are discounting deleted "slots" but are counting "past" revisions for the page where those slots where "alive" as if they pertained to the count. That is indeed correct! hmmmmm... Regarding wbc_entity_usage, that is probably not the right way to go, as this will not include all files that currently have media info entities, unless those entities are also used in the wikitext slot via LUA or the property parser function? Correct me if I am wrong here @Ladsgroup In T238878#5691272 <https://phabricator.wikimedia.org/T238878#5691272>, @matthiasmullie wrote: > I believe that this query (based on @addshore's, but more strict about including only latest revision, of pages that have not been archived) is quite accurate (takes an awful long time to complete though) > Did I overlook anything here - any reason to believe this number is invalid? > > SELECT COUNT(DISTINCT page_id) > # page excludes deleted pages (which are in archive) > FROM page > # joining on page_latest - we only care about most recent (not revdeleted) revision > INNER JOIN revision ON rev_id = page_latest AND rev_deleted = 0 > INNER JOIN slots ON slot_revision_id = rev_id > # mediainfo slot must contain actual content > INNER JOIN content ON slot_content_id = content_id AND content_size > 122 > INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 'mediainfo'; > > +-------------------------+ > | COUNT(DISTINCT page_id) | > +-------------------------+ > | 3004300 | > +-------------------------+ > 1 row in set (33 min 31.86 sec) The looks better to me! And wow, yes, it does take some time. I guess that is okay? If we don't need to monitor this too closely TASK DETAIL https://phabricator.wikimedia.org/T238878 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Addshore Cc: nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, E1presidente, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs