Addshore added a subscriber: Smalyshev.
Addshore added a comment.

An opaque k/v store won't allow anything but discrete lookup by entity ID, how are violations queried? In other words, this seems to only be a small part of the larger model, what does that look like, and why are we creating this separation (i.e. what problem does this solve)?

These constraint violations will be loaded in the Wikidata Query service and queryable via SPARQL there.
The constraints results are also needed in mediawiki to expose to the user in the UI and APIs.
There is the possibility that we will need to provide dumps of all constraint violations in order to ease the loading of data into WDQS servers that are starting from scratch, but @Smalyshev would have to chime in there.

Numbers regarding total number of entities, and the size of the values will be important of course, but perhaps most important will be some idea about access patterns. How frequently will entities be (over)written? How often read? I realize the answer to this is probably a distribution, and that this may involve some educated guess work.

Some numbers I can put here right now are:

  • Wikidata has 50 million entities, each of which can have a violations report. Assuming we keep all of the violations for a single entity in a single blob, which would be the current plan, that would mean 50 million things to be stored.
    • If we wanted to store constraint data for statements individually as @daniel suggested above each of the current 544 million statements would be storing their own blob. The blob will of course be smaller than the blob for the whole entity, but the number of entries is much larger.
  • There are currently around 20,000 edits per hour on wikidata.org, so the number of writes to the storage should be that or less. The plan will be to have the regular running of these constraints be performed by jobs (T204031), in which case there will likely be some deduplication if an entity is edited multiple times in a short period.
  • Reads: I'll have to have a more careful look
  • Size: I'll have to look at this too

What happens if constraint definitions change? Are we able to wholesale drop the older ones? Is the constraint check inlined on a miss, and is the latency (and additional load) under such circumstances acceptable? Or will some sort of transition be needed where we fall back to the older check when that's available, and replace them gradually?

As mentioned above the constraint reports will be created post edit. The plan to use the job queue means in theroy if we purged all of the data, well, it would slowly rebuild itself, but the constraint checks can take some time, hence the request for more persistent storage than we are currently getting with memcached.

I'll probably have more questions.

I look forward to them


TASK DETAIL
https://phabricator.wikimedia.org/T204024

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Smalyshev, Eevans, daniel, mobrovac, Jonas, Lucas_Werkmeister_WMDE, Aklapper, Addshore, Lahi, Gq86, GoranSMilovanovic, QZanden, merbst, LawExplorer, Agabi10, Hardikj, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, fgiunchedi
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to