[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-20 Thread Addshore
Addshore added a comment.


  In T238878#5752474 , 
@Nuria wrote:
  
  > @Addshore @matthiasmullie Can we document on 
https://www.mediawiki.org/wiki/Wikibase/Schema/wbc_entity_usage what the "M" in 
wbc_entity_usage  stands for? 
  > See addshore select above:
  >
  >   select count(*) from wbc_entity_usage WHERE eu_entity_id LIKE 'M%';
  
  Our docs for this can be found at 
https://doc.wikimedia.org/Wikibase/master/php/topic_usagetracking.html

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, 
E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, 
Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, Scott_WUaS, 
Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, Base, aude, 
Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, 
Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-19 Thread Nuria
Nuria added a comment.


  > I'd like a clear explanation as the multiple threads have become 
complicated to resolve objectively.
  
  Indeed.
  @Milimetric we just want to have data for commons, for wikidata there are 
plenty metrics available that also measure usage from this table, see: 
https://grafana.wikimedia.org/d/00176/wikidata-entity-usage-project?orgId=1&fullscreen&panelId=17&from=now-5y&to=now

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Nuria
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, 
E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, 
Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, Scott_WUaS, 
Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, Base, aude, 
Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, 
Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-19 Thread Milimetric
Milimetric added a comment.


  @Nuria: what I thought we were doing is productionizing @matthiasmullie's 
approach from T239565#5728117 
.  This seemed sensible to 
me, and the change I submitted for T239565 
 properly comments that we are not 
measuring MediaInfo entities with the query on `wbc_entity_usage`.  If any of 
this is wrong or not what others were expecting, I'd like a clear explanation 
as the multiple threads have become complicated to resolve objectively.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Milimetric
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, 
E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, 
Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, Scott_WUaS, 
Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, Base, aude, 
Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, 
Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-19 Thread Ladsgroup
Ladsgroup added a comment.


  In T238878#5752474 , 
@Nuria wrote:
  
  > @Addshore @matthiasmullie Can we document on 
https://www.mediawiki.org/wiki/Wikibase/Schema/wbc_entity_usage what the "M" in 
wbc_entity_usage  stands for? 
  > See addshore select above:
  >
  >   select count(*) from wbc_entity_usage WHERE eu_entity_id LIKE 'M%';
  
  `M` is about the entities not the aspects, M stands for MediInfo which are 
basically items for SDC.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ladsgroup
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, 
E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, 
Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, Scott_WUaS, 
Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, Base, aude, 
Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, 
Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-18 Thread Nuria
Nuria added a comment.


  @Addshore @matthiasmullie Can we document on 
https://www.mediawiki.org/wiki/Wikibase/Schema/wbc_entity_usage what the "M" in 
wbc_entity_usage  stands for?

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Nuria
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, 
E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, 
Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, Scott_WUaS, 
Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, Base, aude, 
Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, 
Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-12 Thread daniel
daniel added a comment.


  In T238878#5730630 , 
@Milimetric wrote:
  
  > In T238878#5708257 , 
@daniel wrote:
  >
  >> By the way, if you find rev_deleted != 0 for the current revision, it's a 
bug. The deletion flags for the current revisions will be ignored by the 
storage layer.
  >
  > I checked this as I was re-reading and there are 2500 such problems:
  
  Thanks for following up on this.
  
  I checked the code again, and I was (somewhat) wrong: the current revision 
can't have its //text// suppressed, but it can have the user and comment 
suppressed. The but map for DELETED_TEXT is 0x01, so the entry would be bad if 
rev_deleted is an odd number. I didn't spot any odd values when eyeballing the 
output you linked.
  
  Special:RevisionDelete prevents the DELETED_TEXT flag to be applied to the 
current revision. I didn't check whether that's safe against race conditions, 
and whether it can be circumvented with partial undeletion. But so far, there 
seems to be nothing bad in the database.
  
  I seem to recall that some code skips permission check for the current 
revision, but I can't find any such code right now to check whether it skips 
the check just for the text, or also for comments and user name. In any case, 
for the purpose of this task, rev_deleted can be ignored.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-10 Thread Milimetric
Milimetric added a comment.


  I translated T238878#5708511 
 to Hive to familiarize 
myself with it and get ahead of productionizing it.  I got similar numbers 
first as a sanity check and then grouped the numbers by the month of the 
page_latest revision's timestamp.  I was wondering if the numbers increase in 
some nice way that we could report on regardless of the overall total.  There's 
a fairly clear trend towards more structured data.  And this is not a clear way 
to show the trend because we're just looking at the latest revision not all 
revisions, but maybe it's useful to @Abit as she thinks about this metric:
  
  | month   | _c1|
  | 2019-01 | 50396  |
  | 2019-02 | 90102  |
  | 2019-03 | 108367 |
  | 2019-04 | 112658 |
  | 2019-05 | 140429 |
  | 2019-06 | 442834 |
  | 2019-07 | 142744 |
  | 2019-08 | 399805 |
  | 2019-09 | 757923 |
  | 2019-10 | 255584 |
  | 2019-11 | 531414 |
  |
  
  Also,
  
  In T238878#5708257 , 
@daniel wrote:
  
  > By the way, if you find rev_deleted != 0 for the current revision, it's a 
bug. The deletion flags for the current revisions will be ignored by the 
storage layer.
  
  I checked this as I was re-reading and there are 2500 such problems:
  
use wmf_raw;

 select p.wiki_db,
page_id,
page_title,
page_namespace,
rev_deleted

   from mediawiki_page p
inner join
mediawiki_revision ron r.wiki_db = p.wiki_db
and page_id = rev_page
and page_latest = rev_id
and rev_deleted <> 0

  where p.snapshot = '2019-11'
and r.snapshot = '2019-11'

  order by p.wiki_db
;
  
  F31468843: pages_with_rev_deletion_on_latest.tsv 

  
  (side-note: queries on these raw tables as imported from mediawiki are slow 
because these tables are not stored in parquet format)

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Milimetric
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-10 Thread Nuria
Nuria added a comment.


  I see 7.9 million wikidata items on that table and 1936 mediawinfo items.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Nuria
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-10 Thread Addshore
Addshore added a comment.


  In T238878#5728735 , 
@Nuria wrote:
  
  >> but by and large, it's currently wikidata entities being pulled in via Lua
  >
  > It is exclusively wikidata items at this time correct?
  
  Not based on my query above.
  There are 3401 flags showing usage of mediainfo entities.
  
mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> select count(DISTINCT 
eu_page_id) from wbc_entity_usage WHERE eu_entity_id LIKE 'M%';
++
| count(DISTINCT eu_page_id) |
++
|   1936 |
++
  
  Apparently 1936 pages on commmons using data from media info entities via LUA

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-10 Thread Nuria
Nuria added a comment.


  > but by and large, it's currently wikidata entities being pulled in via Lua
  
  It is exclusively wikidata items at this time correct?

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Nuria
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-10 Thread Addshore
Addshore added a comment.


  Indeed it will
  
mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> select count(*) from 
wbc_entity_usage WHERE eu_entity_id LIKE 'M%';
+--+
| count(*) |
+--+
| 3401 |
+--+
1 row in set (0.00 sec)

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-09 Thread Nuria
Nuria added a comment.


  Select for @mpopov to look at
  
select count(distinct eu_page_id)  from mediawiki_page as P JOIN 
mediawiki_wbc_entity_usage as W ON (W.eu_page_id = P.page_id and 
W.wiki_db=P.wiki_db and W.snapshot=P.snapshot)  where W.wiki_db="commonswiki" 
and W.snapshot="2019-11" and page_namespace = 6  and eu_aspect RLIKE '^L' and 
page_is_redirect=0;
  
  @mpopov

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Nuria
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-09 Thread Nuria
Nuria added a comment.


  After looking at this for a bit with @Ladsgroup and @mpopov  (cc @Abit ) 
  A commons page can have data from wikidata and the wikibase instance on 
commons (called WikibaseMediaInfo, MediaInfor for short).
  
  wbc_entity_usage will not have any data from structured data on commons that 
is kept on WikibaseMediaInfo. The data on that table only pertains to wikidata 
items used by "wiki clients". "wiki clients" are users of iwkidata's data , 
like eswiki.  So,  to find "how many file pages on commons contain at least 1 
structured data element" looking at slots is sufficient. That should be 
inclusive of all commons files. That number is ~3 million.
  
  Now, there are pages in commons that use, for example, wikidata info boxes, 
like: https://commons.wikimedia.org/wiki/Category:Lucas_Werkmeister
  The pages using wikidata items on commons can be tracked with 
wbc_entity_usage and that number is 7.9 million as of today. Now, to be clear, 
the 7.9 pertains entirely to a different set of infrastructure, namely 
wikidata's.  None of those 7.9 structure data items represent any info on the 
WikibaseMediaInfo instance.
  
  Ping @matthiasmullie to asses this info

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Nuria
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-03 Thread matthiasmullie
matthiasmullie added a comment.


  > If you want to count //pages//, then you are only looking at the current 
revision, right? And the current revision of a page is by definition never 
deleted/suppressed. So no need to check rev_deleted.
  
  So page_latest can never point to a revdel'ed revision? Ok cool, had no idea 
- might as well skip that condition (and the entire join with revision table) 
then if it's already implicitly so:
  
SELECT COUNT(DISTINCT page_id)
# page excludes deleted pages (which are in archive)
FROM page
# joining on page_latest - we only care about most recent
INNER JOIN slots ON slot_revision_id = page_latest
# mediainfo slot must contain actual content
INNER JOIN content ON slot_content_id = content_id AND content_size > 122
INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 'mediainfo';
  
  
  
+-+
| COUNT(DISTINCT page_id) |
+-+
| 3045691 |
+-+
1 row in set (7 min 23.17 sec)

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: matthiasmullie
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-03 Thread daniel
daniel added a comment.


  > I was thinking it made sense to exclude rev-deleted revisions: if a page 
only has (SDC) revisions that are partly restricted for most people, that we 
might not want to count those.
  
  If you want to count //pages//, then you are only looking at the current 
revision, right? And the current revision of a page is by definition never 
deleted/suppressed. So no need to check rev_deleted.
  
  By the way, if you find rev_deleted != 0 for the current revision, it's a 
bug. The deletion flags for the current revisions will be ignored by the 
storage layer.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-03 Thread matthiasmullie
matthiasmullie added a comment.


  @Milimetric it wasn't being used to omit deleted pages (see inline comments 
about archive table), but to **also** exclude those RevDel'ed revisions.
  I was thinking it made sense to exclude rev-deleted revisions: if a page only 
has (SDC) revisions that are partly restricted for most people, that we might 
not want to count those.
  But one could argue that we should include them, and I'm happy to exclude 
that condition from the query! (though the number is likely extremely low, too 
low to even matter)

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: matthiasmullie
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-12-02 Thread Milimetric
Milimetric added a comment.


  I noticed that three folks used `rev_deleted = 0` to mean "revision not 
deleted", but this field means something completely different 
(https://www.mediawiki.org/wiki/Manual:Revision_table#rev_deleted).  Basically, 
it's about parts of the revision that have been suppressed from different types 
of users in the interface (like hiding revision comments).  Good news is that 
revisions in the revision table are by definition not deleted.  When a page is 
deleted, its revisions all get inserted into the archive table and removed from 
the revision table.  So when you did `rev_deleted = 0` you were just filtering 
out revisions that had any kind of suppression applied.  I think that's a small 
number and didn't seem to add to the confusion above, but should be included 
just in case.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Milimetric
Cc: Milimetric, Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-28 Thread daniel
daniel added a comment.


  In T238878#5700408 , 
@Nuria wrote:
  
  > Can someone explain why the content table is as large as the revision 
table? I though these tables  were not used much other than commons but content 
table is big in dewiki for example
  
  Because //all// content is managed in the content table, including the main 
slot.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel
Cc: Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-28 Thread Nuria
Nuria added a comment.


  Can someone explain why the content table is as large as the revision table? 
I though these tables  were not used much other than commons but content table 
is big in dewiki for example

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Nuria
Cc: Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-27 Thread Nuria
Nuria added a comment.


  We will be scooping the slot tables to the cluster so these queries can run 
in parallel.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Nuria
Cc: Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-27 Thread Cparle
Cparle added a comment.


  FWIW that query also looks sensible to me. Unless someone can point out 
something wrong with it I think this is probably accurate?

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Cparle
Cc: Cparle, nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, 
matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, GoranSMilovanovic, 
QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, 
rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, 
Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-26 Thread Addshore
Addshore added a comment.


  In T238878#5684895 , 
@Nuria wrote:
  
  > @Addshore : disclaimer: I know next to nothing about this but how are you 
taking into account that the revision is the last one for the page? That is, a 
page might have had a structured data item  in a prior revision and from its 
most current revision that structured data item is removed? In your select * i 
think* you are counting revisions with slots that contain data but if that 
revision is a past one for the page it should not be counted, as we only care 
about  structured data present in the page right now, correct? Hopefully this  
makes sense, I think you are discounting deleted "slots" but are counting 
"past" revisions for the page where those slots where "alive" as if they 
pertained to the count.
  
  That is indeed correct! hm...
  
  Regarding wbc_entity_usage, that is probably not the right way to go, as this 
will not include all files that currently have media info entities, unless 
those entities are also used in the wikitext slot via LUA or the property 
parser function?
  Correct me if I am wrong here @Ladsgroup
  
  In T238878#5691272 , 
@matthiasmullie wrote:
  
  > I believe that this query (based on @addshore's, but more strict about 
including only latest revision, of pages that have not been archived) is quite 
accurate (takes an awful long time to complete though)
  > Did I overlook anything here - any reason to believe this number is invalid?
  >
  >   SELECT COUNT(DISTINCT page_id)
  >   # page excludes deleted pages (which are in archive)
  >   FROM page
  >   # joining on page_latest - we only care about most recent (not 
revdeleted) revision
  >   INNER JOIN revision ON rev_id = page_latest AND rev_deleted = 0
  >   INNER JOIN slots ON slot_revision_id = rev_id
  >   # mediainfo slot must contain actual content
  >   INNER JOIN content ON slot_content_id = content_id AND content_size > 122
  >   INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 
'mediainfo';
  >
  >   +-+
  >   | COUNT(DISTINCT page_id) |
  >   +-+
  >   | 3004300 |
  >   +-+
  >   1 row in set (33 min 31.86 sec)
  
  The looks better to me!
  
  And wow, yes, it does take some time.
  I guess that is okay? If we don't need to monitor this too closely

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, matthiasmullie, 
Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 4748kitoko, 
darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, 
E1presidente, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, 
Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, 
Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, 
Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, 
Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-25 Thread matthiasmullie
matthiasmullie added a comment.


  I don't really know much about how `mediawiki_history` gets populated, but 
`wbcreateclaim` & `wbeditentity` (in addition to `wbsetclaim`) could also be 
used to create (or edit) statements IIRC, so this last set of results is likely 
incomplete.
  
  And I believe `wbc_entity_usage` currently *only* gets populated with 
M-entities fetched via Lua - any page where the M-entity is not used, will not 
show up there (I believe - could be wrong). So that data too (in additional to 
containing other, non-MediaInfo related entities) is likely incomplete.
  
  I believe that this query (based on @addshore's, but more strict about 
including only latest revision, of pages that have not been archived) is quite 
accurate (takes an awful long time to complete though)
  Did I overlook anything here - any reason to believe this number is invalid?
  
SELECT COUNT(DISTINCT page_id)
# page excludes deleted pages (which are in archive)
FROM page
# joining on page_latest - we only care about most recent (not revdeleted) 
revision
INNER JOIN revision ON rev_id = page_latest AND rev_deleted = 0
INNER JOIN slots ON slot_revision_id = rev_id
# mediainfo slot must contain actual content
INNER JOIN content ON slot_content_id = content_id AND content_size > 122
INNER JOIN slot_roles ON role_id = slot_role_id AND role_name = 'mediainfo';
  
+-+
| COUNT(DISTINCT page_id) |
+-+
| 3004300 |
+-+
1 row in set (33 min 31.86 sec)
  
  Just passed 3M files!

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: matthiasmullie
Cc: nettrom_WMF, Ladsgroup, daniel, Mayakp.wiki, gsingers, matthiasmullie, 
Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 4748kitoko, 
darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, 
E1presidente, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, 
Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, 
Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, 
Base, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, 
Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-23 Thread Ladsgroup
Ladsgroup added a comment.


  So the 'wikibase_item' value for pageprops is for client wikis and not repos. 
You need to use property keys like `wb-claims` instead (I queried commons 
nothing showed up, meaning it doesn't add the value. It can be due to MCR but 
I'm not sure.). We don't have any keys for number of labels (or any sort of 
terms) in pageprops table because in wikidata it's accessible through term 
store (wb_terms/wbt_* tables) or WDQS.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ladsgroup
Cc: Ladsgroup, daniel, Mayakp.wiki, gsingers, matthiasmullie, Addshore, 
kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 4748kitoko, darthmon_wmde, 
DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, E1presidente, 
Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, Acer, 
LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, Scott_WUaS, Susannaanas, 
JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, Base, aude, Ricordisamoa, 
Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Mbch331, 
jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-22 Thread mpopov
mpopov added a comment.


  Here are the missing screenshots:
  
  In T238878#5683048 , 
@Nuria wrote:
  
  > The work done by @mpopov  (if you are so kind @mpopov 
  > please upload your screenshots)
  > The wbc_entity_usage table is supposed to hold info on Wikidata usage for 
the pages For example, here's a random file I added some structured data to a 
few days ago: 
https://commons.wikimedia.org/wiki/File:P%C3%B3voa_de_Varzim_-i---i-_(25379025808).jpg
  > When you look for it the commonswiki replica, it has a page ID of 68860692. 
Looking for it in the wbc_entity_usage table we only see that it has a caption 
in English, which I added at basically the same time as several statements:
  
  F31133627: 1.png 
  
  The structured data is missing, despite being added //before// the caption.
  
  > eu_aspect column does have other values like "O" (statements) and "D" (not 
documented, but from a brief investigation looks like it's specifically for 
linking categories on Commons to Wikidata Q-items). There are some records of 
files with "O" aspects (as the MW page notes, it can refer to a variety to 
entity usages but typically it's statements) but then it gets weird because the 
language of the label isn't recorded and there's a bunch of seemingly 
unnecessary info? Take for example the MediaWiki DB data for 
https://commons.wikimedia.org/wiki/File:Jodrell_Bank_Mark_II_5.jpg
  
  F31133631: 2.png 
  
  F31133634: unnamed.png 
  
  > Woof! That's…not great. So, uh, clearly there's something funky going on 
with the Wikibase client extension? Or maybe that's data that was recorded by 
an earlier version of the extension before it knew to append language codes to 
labels? I don't know enough about the nitty-gritty there, so these are just 
vaguely educated guesses.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mpopov
Cc: matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Cparle, Anooprao, SandraF_WMF, 
GoranSMilovanovic, QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, 
_jensen, rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, 
terrrydactyl, Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, 
Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-22 Thread Nuria
Nuria added a comment.


  So, per my comment above, I think the number of items is actually smaller 
than the one @Addshore has computed but more wise folks can correct me if I am 
wrong.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Nuria
Cc: matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Cparle, Anooprao, SandraF_WMF, 
GoranSMilovanovic, QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, 
_jensen, rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, 
terrrydactyl, Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, 
Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-22 Thread Nuria
Nuria added a comment.


  @Addshore : disclaimer: I know next to nothing about this but how are you 
taking into account that the revision is the last one for the page? That is, a 
page might have had a structured data item  in a prior revision and from its 
most current revision that structured data item is removed? In your select * i 
think* you are counting revisions with slots that contain data but if that 
revision is a past one for the page it should not be counted, as we only care 
about  structured data present in the page right now, correct? Hopefully this  
makes sense, I think you are discounting deleted "slots" but are counting 
"past" revisions for the page where those slots where "alive" as if they 
pertained to the count.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Nuria
Cc: matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Cparle, Anooprao, SandraF_WMF, 
GoranSMilovanovic, QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, 
_jensen, rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, 
terrrydactyl, Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, 
Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-22 Thread matthiasmullie
matthiasmullie added a comment.


  As far as I can tell:
  @mpopov's queries (`wbc_entity_usage` based) include both MediaInfo items, 
along with Wikidata-items pulled in via Lua.
  @Addshore's queries are stricktly the amount of MediaInfo items.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: matthiasmullie
Cc: matthiasmullie, Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 
4748kitoko, darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Cparle, Anooprao, SandraF_WMF, 
GoranSMilovanovic, QZanden, Tramullas, Acer, LawExplorer, Salgo60, Silverfish, 
_jensen, rosalieper, Scott_WUaS, Susannaanas, JAllemandou, Jane023, 
terrrydactyl, Wikidata-bugs, Base, aude, Ricordisamoa, Wesalius, 
Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-22 Thread Addshore
Addshore added a comment.


  Last night I ran:
  
mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> SELECT COUNT(DISTINCT 
rev_page) FROM commonswiki.revision INNER JOIN commonswiki.slots ON 
revision.rev_id = slots.slot_revision_id WHERE slots.slot_ role_id = 2;
+--+
| COUNT(DISTINCT rev_page) |
+--+
|  2988233 |
+--+
1 row in set (1 min 51.29 sec)
  
  This morning I altered it slightly to try and take into account deleted pages.
  
mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> SELECT COUNT(DISTINCT 
rev_page)
-> FROM commonswiki.revision
-> INNER JOIN commonswiki.slots
-> ON revision.rev_id = slots.slot_revision_id
-> WHERE slots.slot_role_id = 2
-> AND revision.rev_deleted = 0;


+--+
| COUNT(DISTINCT rev_page) |
+--+
|  2990350 |
+--+
1 row in set (4 min 27.29 sec)
  
  I then also filtered out media info entities that have been emptied. (by 
looking up the size of one of these in the DB for a page that I created and 
removed structured data from)
  
mysql:research@dbstore1004.eqiad.wmnet [commonswiki]> SELECT COUNT(DISTINCT 
rev_page)
-> FROM commonswiki.revision
-> INNER JOIN commonswiki.slots
-> ON revision.rev_id = slots.slot_revision_id
-> INNER JOIN commonswiki.content
-> ON slots.slot_content_id = content.content_id
-> WHERE slots.slot_role_id = 2
-> AND revision.rev_deleted = 0
-> AND content.content_size > 122;

+--+
| COUNT(DISTINCT rev_page) |
+--+
|  2990420 |
+--+
1 row in set (4 min 55.86 sec)
  
  I don't think there will be a more reliable way to check this count.
  As long as I am interpreting "at least one structured data element" 
correctly, which I would read as having a caption or statement.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 4748kitoko, 
darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, 
E1presidente, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, 
Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, 
Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, 
Base, matthiasmullie, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-21 Thread Abit
Abit added a comment.


  @Addshore said there were 2,988,205 as of 9 PST this morning.  I got the 
sense he didn't count using search, but he'll have to confirm.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Abit
Cc: Addshore, kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 4748kitoko, 
darthmon_wmde, DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, 
E1presidente, Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, 
Tramullas, Acer, LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, 
Scott_WUaS, Susannaanas, JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, 
Base, matthiasmullie, aude, Ricordisamoa, Wesalius, Lydia_Pintscher, 
Fabrice_Florin, Raymond, Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T238878: Data about how many file pages on Commons contain at least one structured data element

2019-11-21 Thread Nuria
Nuria added a comment.


  The work done by @mpopov
  
  The wbc_entity_usage table is supposed to hold info on Wikidata usage for the 
pages For example, here's a random file I added some structured data to a few 
days ago: 
https://commons.wikimedia.org/wiki/File:P%C3%B3voa_de_Varzim_-i---i-_(25379025808).jpg
  
  When you look for it the commonswiki replica, it has a page ID of 68860692. 
Looking for it in the wbc_entity_usage table we only see that it has a caption 
in English, which I added at basically the same time as several statements:
  
  Screen Shot 2019-11-19 at 5.58.57 PM.png
  eu_aspect column does have other values like "O" (statements) and "D" (not 
documented, but from a brief investigation looks like it's specifically for 
linking categories on Commons to Wikidata Q-items). There are some records of 
files with "O" aspects (as the MW page notes, it can refer to a variety to 
entity usages but typically it's statements) but then it gets weird because the 
language of the label isn't recorded and there's a bunch of seemingly 
unnecessary info? Take for example the MediaWiki DB data for 
https://commons.wikimedia.org/wiki/File:Jodrell_Bank_Mark_II_5.jpg
  
  Screen Shot 2019-11-19 at 6.36.04 PM.png
  Screen Shot 2019-11-19 at 6.36.33 PM.png
  Woof! That's…not great. So, uh, clearly there's something funky going on with 
the Wikibase client extension? Or maybe that's data that was recorded by an 
earlier version of the extension before it knew to append language codes to 
labels? I don't know enough about the nitty-gritty there, so these are just 
vaguely educated guesses.
  
  So, with the caveat that we're dealing with imperfect data and that some data 
is definitely missing, here are SOME lower bounds on files which have 
structured data:
  
  SELECT
  
COUNT(DISTINCT eu_page_id) AS n_files_with_statements,
COUNT(1) AS n_statements
  
  FROM wbc_entity_usage
  INNER JOIN page ON wbc_entity_usage.eu_page_id = page.page_id
  WHERE page_namespace = 6
  
AND eu_aspect = 'O';
  
  n_files_with_statements: 5 212 680
  n_statements: 8 677 888
  
  Files with labels:
  
  SELECT
  
COUNT(DISTINCT eu_page_id) AS n_files_with_captions
  
  FROM wbc_entity_usage
  INNER JOIN page ON wbc_entity_usage.eu_page_id = page.page_id
  WHERE page_namespace = 6
  
AND eu_aspect RLIKE '^L';
  
  n_files_with_captions: 7 857 520
  
  Files with captions and/or statements:
  
  SELECT
  
COUNT(DISTINCT eu_page_id) AS n_files_with_structured_data
  
  FROM wbc_entity_usage
  INNER JOIN page ON wbc_entity_usage.eu_page_id = page.page_id
  WHERE page_namespace = 6
  
AND (eu_aspect RLIKE '^L' OR eu_aspect = 'O');
  
  n_files_with_structured_data: 7 861 887
  
  Total files (at the moment I ran these queries, so we have a record of % w/ 
structured data):
  
  SELECT COUNT(1) AS n_files_total
  FROM page WHERE page_namespace = 6;
  
  n_files_total: 58 698 512
  
  Thanks for reading! Feel free to correct me if I made any mistakes with the 
queries or my thought process.

TASK DETAIL
  https://phabricator.wikimedia.org/T238878

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Nuria
Cc: kzimmerman, mpopov, Ramsey-WMF, Abit, Nuria, 4748kitoko, darthmon_wmde, 
DannyS712, Nandana, JKSTNK, Akovalyov, Lahi, PDrouin-WMF, Gq86, E1presidente, 
Cparle, Anooprao, SandraF_WMF, GoranSMilovanovic, QZanden, Tramullas, Acer, 
LawExplorer, Salgo60, Silverfish, _jensen, rosalieper, Scott_WUaS, Susannaanas, 
JAllemandou, Jane023, terrrydactyl, Wikidata-bugs, Base, matthiasmullie, aude, 
Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, 
Steinsplitter, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs