Hi, if you need some "Wikibase item diff" function, have a look at the Rust crate I am co-authoring: https://gitlab.com/tobias47n9e/wikibase_rs
It comes with diff code: https://gitlab.com/tobias47n9e/wikibase_rs/-/blob/master/src/entity_diff.rs Should not be too hard to build eg a simple diff command line tool from that. Cheers, Magnus On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey <gleder...@wikimedia.org> wrote: > Hello all! > > First of all, my apologies for the long silence. We need to do better in > terms of communication. I'll try my best to send a monthly update from now > on. Keep me honest, remind me if I fail. > > First, we had a security incident at the end of December, which forced us > to move from our Kafka based update stream back to the RecentChanges > poller. The details are still private, but you will be able to get the full > story soon on phabricator [1]. The RecentChange poller is less efficient > and this is leading to high update lag again (just when we thought we had > things slightly under control). We tried to mitigate this by improving the > parallelism in the updater [2], which helped a bit, but not as much as we > need. > > Another attempt to get update lag under control is to apply back pressure > on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is > obviously less than ideal (at least as long as WDQS updates are lagging as > often as they are), but does allow the service to recover from time to > time. We probably need to iterate on this, provide better granularity, > differentiate better between operations that have an impact on update lag > and those which don't. > > On the slightly better news side, we now have a much better understanding > of the update process and of its shortcomings. The current process does a > full diff between each updated entity and what we have in blazegraph. Even > if a single triple needs to change, we still read tons of data from > Blazegraph. While this approach is simple and robust, it is obviously not > efficient. We need to rewrite the updater to take a more event streaming / > reactive approach, and only work on the actual changes. This is a big chunk > of work, almost a complete rewrite of the updater, and we need a new > solution to stream changes with guaranteed ordering (something that our > kafka queues don't offer). This is where we are focusing our energy at the > moment, this looks like the best option to improve the situation in the > medium term. This change will probably have some functional impacts [3]. > > Some misc things: > > We have done some work to get better metrics and better understanding of > what's going on. From collecting more metrics during the update [4] to > loading RDF dumps into Hadoop for further analysis [5] and better logging > of SPARQL requests. We are not focusing on this analysis until we are in a > more stable situation regarding update lag. > > We have a new team member working on WDQS. He is still ramping up, but we > should have a bit more capacity from now on. > > Some longer term thoughts: > > Keeping all of Wikidata in a single graph is most probably not going to > work long term. We have not found examples of public SPARQL endpoints with > > 10 B triples and there is probably a good reason for that. We will > probably need to split the graphs at some point. We don't know how yet > (that's why we loaded the dumps into Hadoop, that might give us some more > insight). We might expose a subgraph with only truthy statements. Or have > language specific graphs, with only language specific labels. Or something > completely different. > > Keeping WDQS / Wikidata as open as they are at the moment might not be > possible in the long term. We need to think if / how we want to implement > some form of authentication and quotas. Potentially increasing quotas for > some use cases, but keeping them strict for others. Again, we don't know > how this will look like, but we're thinking about it. > > What you can do to help: > > Again, we're not sure. Of course, reducing the load (both in terms of > edits on Wikidata and of reads on WDQS) will help. But not using those > services makes them useless. > > We suspect that some use cases are more expensive than others (a single > property change to a large entity will require a comparatively insane > amount of work to update it on the WDQS side). We'd like to have real data > on the cost of various operations, but we only have guesses at this point. > > If you've read this far, thanks a lot for your engagement! > > Have fun! > > Guillaume > > > > > [1] https://phabricator.wikimedia.org/T241410 > [2] https://phabricator.wikimedia.org/T238045 > [3] https://phabricator.wikimedia.org/T244341 > [4] https://phabricator.wikimedia.org/T239908 > [5] https://phabricator.wikimedia.org/T241125 > [6] https://phabricator.wikimedia.org/T221774 > > -- > Guillaume Lederrey > Engineering Manager, Search Platform > Wikimedia Foundation > UTC+1 / CET > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata >
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata