Addshore added a comment.

  In T238878#5684895 <https://phabricator.wikimedia.org/T238878#5684895>, 
@Nuria wrote:
  
  > @Addshore : disclaimer: I know next to nothing about this but how are you 
taking into account that the revision is the last one for the page? That is, a 
page might have had a structured data item  in a prior revision and from its 
most current revision that structured data item is removed? In your select * i 
think* you are counting revisions with slots that contain data but if that 
revision is a past one for the page it should not be counted, as we only care 
about  structured data present in the page right now, correct? Hopefully this  
makes sense, I think you are discounting deleted "slots" but are counting 
"past" revisions for the page where those slots where "alive" as if they 
pertained to the count.
  
  That is indeed correct! hmmmmm...
  
  Regarding wbc_entity_usage, that is probably not the right way to go, as this 
will not include all files that currently have media info entities, unless 
those entities are also used in the wikitext slot via LUA or the property 
parser function?
  Correct me if I am wrong here @Ladsgroup
  
  In T238878#5691272 <https://phabricator.wikimedia.org/T238878#5691272>, 
@matthiasmullie wrote:
  
  > I believe that this query (based on @addshore's, but more strict about 
including only latest revision, of pages that have not been archived) is quite 
accurate (takes an awful long time to complete though)
  > Did I overlook anything here - any reason to believe this number is invalid?
  >
  >   SELECT COUNT(DISTINCT page_id)
  >   # page excludes deleted pages (which are in archive)
  >   FROM page
  >   # joining on page_latest - we only care about most recent (not 
revdeleted) revision
  >   INNER JOIN revision ON rev_id = page_latest AND rev_deleted = 0
  >   INNER JOIN slots ON slot_revision_id = rev_id
  >   # mediainfo slot must contain actual content
  >   INNER JOIN content ON slot_content_id = content_id AND content_size > 122
  >   INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 
'mediainfo';
  >
  >   +-------------------------+
  >   | COUNT(DISTINCT page_id) |
  >   +-------------------------+
  >   |                 3004300 |
  >   +-------------------------+
  >   1 row in set (33 min 31.86 sec)
  
  The looks better to me!
  
  And wow, yes, it does take some time.
  I guess that is okay? If we don't need to monitor this too closely

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, matthiasmullie, 
Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 4748kitoko, 
darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, 
E1presidente, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, 
Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, 
Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, 
Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, 
Steinsplitter, Mbch331, jeremyb
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to