dcausse added a comment.
In T342111#9054619 <https://phabricator.wikimedia.org/T342111#9054619>, @Manuel wrote: > Hi @dcausse, thank you so much, this is very helpful! \o/ > >> I believe that at first we are interested in knowing the number of triples that would be moved out > > The "number of triples that would be moved out" seems to be the primary metric of interest for the Blazegraph split. But after your explanation of the table, I now realize that this metric produces is not equal to the number of rows in that table that are required to represent these triples in the table, correct? So could you quickly confirm, that the "number of triples that would be moved out" (distinct triples) is actually the preferable metric for our purposes (and not e.g. the "number of rows that would be moved out")? The table `wikibase_rdf` does have one row per triple and an additional column named `context` that we use to annotate the entity the triple was extracted from while reading the dump. With the caveats of shared values and references that have respectively `<http://wikiba.se/ontology#Value>` and `<http://wikiba.se/ontology#Reference>` set as their `context` column. It is true that duplicates are in there and a `select count(*) from wikibase_rdf` will give a number greater than the number of triples stored in blazegraph. To my knowledge such duplicates are: - shared values and references, this is very likely to have a high number of duplicates even if wikibase tries to deduplicate some on the fly while extracting the RDF dump - some metedata regarding sitelinks such as a triple like this: `<https://be-tarask.wikipedia.org/> wikibase:wikiGroup "wikipedia" .` are likely to be duplicated in the `wikibase_rdf` table. I forgot to mention them in the notebook. And to answer your question you are correct the number of distinct triples is what matters to us, so to get an accurate number you might have to `distinct(subject, predicate, object)` at some point, thanks! TASK DETAIL https://phabricator.wikimedia.org/T342111 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AndrewTavis_WMDE, dcausse Cc: Lydia_Pintscher, dcausse, Gehel, dr0ptp4kt, AndrewTavis_WMDE, Aklapper, Manuel, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org