Hi all,

 At around 1PM UTC today (Sep 3) we started experiencing stability issues
with WDQS, localized (at least at the moment) to a single, of two,
datacenter. Unfortunately, we haven't been able to pinpoint the issue as of
now. We suspect that someone is running a query that affects Blazegraph -
that happened a few times in the past. Unfortunately, our usual tactics did
help us to find which one.

We are working on identifying the issue, but it's clear that this could in
a few hours bring the service down, so we are working on a quick
workaround. Since we observed the issue is only causing actual service
failures after ~2h after restart, for now we are going to introduce a
procedure that will restart servers randomly, so that uptime for each will
be at max around 1h. Only one server should be restarted at any given time.
This will cause some queries to be killed, when each of the servers is
restarted, but the alternative is worse.

We'll continue to work to find the root cause and will inform you of all of
our progress. We will also post our progress here: [1].

Regards,
Zbyszko Papierski

[1] https://phabricator.wikimedia.org/T290330

-- 

Zbyszko Papierski (He/Him)

Senior Software Engineer

Wikimedia Foundation <https://wikimediafoundation.org/>
_______________________________________________
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org

Reply via email to