Hi,

if you need some "Wikibase item diff" function, have a look at the Rust
crate I am co-authoring:
https://gitlab.com/tobias47n9e/wikibase_rs

It comes with diff code:
https://gitlab.com/tobias47n9e/wikibase_rs/-/blob/master/src/entity_diff.rs

Should not be too hard to build eg a simple diff command line tool from
that.

Cheers,
Magnus

On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey <gleder...@wikimedia.org>
wrote:

> Hello all!
>
> First of all, my apologies for the long silence. We need to do better in
> terms of communication. I'll try my best to send a monthly update from now
> on. Keep me honest, remind me if I fail.
>
> First, we had a security incident at the end of December, which forced us
> to move from our Kafka based update stream back to the RecentChanges
> poller. The details are still private, but you will be able to get the full
> story soon on phabricator [1]. The RecentChange poller is less efficient
> and this is leading to high update lag again (just when we thought we had
> things slightly under control). We tried to mitigate this by improving the
> parallelism in the updater [2], which helped a bit, but not as much as we
> need.
>
> Another attempt to get update lag under control is to apply back pressure
> on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is
> obviously less than ideal (at least as long as WDQS updates are lagging as
> often as they are), but does allow the service to recover from time to
> time. We probably need to iterate on this, provide better granularity,
> differentiate better between operations that have an impact on update lag
> and those which don't.
>
> On the slightly better news side, we now have a much better understanding
> of the update process and of its shortcomings. The current process does a
> full diff between each updated entity and what we have in blazegraph. Even
> if a single triple needs to change, we still read tons of data from
> Blazegraph. While this approach is simple and robust, it is obviously not
> efficient. We need to rewrite the updater to take a more event streaming /
> reactive approach, and only work on the actual changes. This is a big chunk
> of work, almost a complete rewrite of the updater, and we need a new
> solution to stream changes with guaranteed ordering (something that our
> kafka queues don't offer). This is where we are focusing our energy at the
> moment, this looks like the best option to improve the situation in the
> medium term. This change will probably have some functional impacts [3].
>
> Some misc things:
>
> We have done some work to get better metrics and better understanding of
> what's going on. From collecting more metrics during the update [4] to
> loading RDF dumps into Hadoop for further analysis [5] and better logging
> of SPARQL requests. We are not focusing on this analysis until we are in a
> more stable situation regarding update lag.
>
> We have a new team member working on WDQS. He is still ramping up, but we
> should have a bit more capacity from now on.
>
> Some longer term thoughts:
>
> Keeping all of Wikidata in a single graph is most probably not going to
> work long term. We have not found examples of public SPARQL endpoints with
> > 10 B triples and there is probably a good reason for that. We will
> probably need to split the graphs at some point. We don't know how yet
> (that's why we loaded the dumps into Hadoop, that might give us some more
> insight). We might expose a subgraph with only truthy statements. Or have
> language specific graphs, with only language specific labels. Or something
> completely different.
>
> Keeping WDQS / Wikidata as open as they are at the moment might not be
> possible in the long term. We need to think if / how we want to implement
> some form of authentication and quotas. Potentially increasing quotas for
> some use cases, but keeping them strict for others. Again, we don't know
> how this will look like, but we're thinking about it.
>
> What you can do to help:
>
> Again, we're not sure. Of course, reducing the load (both in terms of
> edits on Wikidata and of reads on WDQS) will help. But not using those
> services makes them useless.
>
> We suspect that some use cases are more expensive than others (a single
> property change to a large entity will require a comparatively insane
> amount of work to update it on the WDQS side). We'd like to have real data
> on the cost of various operations, but we only have guesses at this point.
>
> If you've read this far, thanks a lot for your engagement!
>
>   Have fun!
>
>       Guillaume
>
>
>
>
> [1] https://phabricator.wikimedia.org/T241410
> [2] https://phabricator.wikimedia.org/T238045
> [3] https://phabricator.wikimedia.org/T244341
> [4] https://phabricator.wikimedia.org/T239908
> [5] https://phabricator.wikimedia.org/T241125
> [6] https://phabricator.wikimedia.org/T221774
>
> --
> Guillaume Lederrey
> Engineering Manager, Search Platform
> Wikimedia Foundation
> UTC+1 / CET
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to