dcausse added a comment.

  In T342111#9054619 <https://phabricator.wikimedia.org/T342111#9054619>, 
@Manuel wrote:
  
  > Hi @dcausse, thank you so much, this is very helpful! \o/
  >
  >> I believe that at first we are interested in knowing the number of triples 
that would be moved out
  >
  > The "number of triples that would be moved out" seems to be the primary 
metric of interest for the Blazegraph split. But after your explanation of the 
table, I now realize that this metric produces is not equal to the number of 
rows in that table that are required to represent these triples in the table, 
correct? So could you quickly confirm, that the "number of triples that would 
be moved out" (distinct triples) is actually the preferable metric for our 
purposes (and not e.g. the "number of rows that would be moved out")?
  
  The table `wikibase_rdf` does have one row per triple and an additional 
column named `context` that we use to annotate the entity the triple was 
extracted from while reading the dump.
  With the caveats of shared values and references that have respectively 
`<http://wikiba.se/ontology#Value>` and `<http://wikiba.se/ontology#Reference>` 
set as their `context` column.
  It is true that duplicates are in there and a `select count(*) from 
wikibase_rdf` will give a number greater than the number of triples stored in 
blazegraph.
  
  To my knowledge such duplicates are:
  
  - shared values and references, this is very likely to have a high number of 
duplicates even if wikibase tries to deduplicate some on the fly while 
extracting the RDF dump
  - some metedata regarding sitelinks such as a triple like this: 
`<https://be-tarask.wikipedia.org/> wikibase:wikiGroup "wikipedia" .` are 
likely to be duplicated in the `wikibase_rdf` table. I forgot to mention them 
in the notebook.
  
  And to answer your question you are correct the number of distinct triples is 
what matters to us, so to get an accurate number you might have to 
`distinct(subject, predicate, object)` at some point, thanks!

TASK DETAIL
  https://phabricator.wikimedia.org/T342111

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE, dcausse
Cc: Lydia_Pintscher, dcausse, Gehel, dr0ptp4kt, AndrewTavis_WMDE, Aklapper, 
Manuel, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to