Re: Fuseki lock-up - any ideas?

Dave Reynolds Wed, 17 Feb 2021 09:56:08 -0800

Hi Andy,

Thanks for the comments.


We'll investigate further and will update the thread if we get a resolution.

Some responses inline.

Dave

On 17/02/2021 10:29, Andy Seaborne wrote:

Not clear to me exactly what's happening.
Is there any characteristic like increased load around or soon beforethe problem arises?


Not that we can see.

Are the read calls being made from the same machine as the updates?

No. Updates will come from within the same node but the read calls comefrom one of a number replicas of an API service which are typicallyrunning on different nodes.

Martynas's experience suggests a possibility because client code caneventually interfere with the server.
1/
A possibility is that an HTTP connection (client-side) is not beinghandled correctly. There is an (HttpClient) connection pool. If aconnection is mishandled, then some time later, the server may not beable to see that it can write the update 204.

Agreed. Though not clear to me how that leads to an apparent blocking ofread operations (from other locations) completing.

It is not necessary the operation immediately proceeding the updatesthat is mishandling things. Because each end has connection pools, onetainted connection in the pool can take a while to show up or the poolbecoming exhausted


Good point.

The problem might be the server can't send the reply because the serverconnection is still "in use". Due to pooling there are possibly severalconnections between a single client and the server.
The RDFConnection calls "querySelect", "queryResults" give the apppatterns where this does not happen.
If you use "query()" to get a QueryExection, it still should be consumedor closed in a try-with-resource block.
This is a possibility - it does not directly explain what you've seenbut the effects of it are hard to absolutely characterise how they willappear.

Generally careful with things like try-with-resource blocks butdefinitely worth checking again.

2/
It's made more complicated in TDB1 because read requests may be stoppingthe finalization (i.e. clearing out of the commit log) of updates. Thatagain is something where the root cause is not at the point of failurebut a some point earlier.
This happens if there is a period of high load and may be made worse bycase 1 like effects.

Yes, one of the things we're testing is shifting to TDB2 to see if thatchanges the behaviour. Since we're not seeing any clear pattern ofincreased load it feels like a long shot but would at least eliminatethat factor and may shed further light.


Thanks again.

Dave


     Andy

On 17/02/2021 08:04, Dave Reynolds wrote:

Thanks Martynas. Don't think it quite fits the symptoms, and somethingwe've tried to be careful of, but worth another look.


Dave

On 16/02/2021 12:55, Martynas Jusevičius wrote:

Not Fuseki-related per se, but I've experienced something similar when
the HTTP client is running out of connections.

On Tue, Feb 16, 2021 at 11:50 AM Dave Reynolds
<dave.e.reyno...@gmail.com> wrote:


We have a mysterious problem with fuseki in production that we've not
seen before.  Posting in case anyone has seen something similar and has
any advice but I realise there's not really much here to go on.

Environment:
     Fuseki 3.17 (was 3.16, tried upgrade just in case) using TDB1
     OpenJDK java 8
     Docker container (running in k8s pod)
     ABW EBS file system
     O(2k) small updates per day (uses RDFConnection to send update)
     Variable read request rate but issue hits at low request levels

Symptoms are that fuseki receives an update request but nevercompletes it:


      INFO  550175  POST http://localhost:3030/ds
      INFO  550175  Update
      INFO  550175  204 No Content (20 ms)
      INFO  550176  POST http://localhost:3030/ds
      INFO  550176  Update
-->
      INFO  550178  Query = ASK { ?s ?p ?o }
      INFO  550178  GET
http://localhost:3030/ds?query=ASK+%7B+%3Fs+%3Fp+%3Fo+%7D
      INFO  550179  GET
http://localhost:3030/ds?query=ASK+%7B+%3Fs+%3Fp+%3Fo+%7D
      INFO  550179  Query = ASK { ?s ?p ?o }

So no 204 return from request 550176.

  From that point on fuseki continues to log incoming read queries but
does not answer any of them and the update request never terminates.
Acts as if there's some form of deadlock.

Update requests are serialised, there's never more than one inflight at

a time.

It's not the update itself that's the issue. It's small and if the
container is restarted with the same data and the same update sequence
is reapplied it all works fine.

The jvm stats all look completely fine in the prometheus records.

The various parts of this set up have been in various production
settings without problems in the past. In particular, we've run the
exact same pattern of mixed updates and queries in fuseki in a k8s
environment for two years without ever having a lockup. But on a new
deployment it's happening every few days.

There are differences between the new and old deployments but the ones
we've identified seem very unlikely to be the cause. We've not used
RDFConnection in the client before but can't see how that could affect
this. We don't often run with TDB on EBS but we do have a dozen
instances of that around which haven't had problems. We have generally
shifted to AWS Corretto as the jvm but we have plenty of OpenJDK
instances around without problems. The docker image is slightly unusual
in using the s6 overlay init system rather than running fuseki as the

root process but again can't see how this might cause these symptomsand

other uses of that, with fuseki, have been fine.

We'll find a workaround eventually, possibly involving shifting toTDB2,

but posting in case anyone has had an experience similar enough to this
to give us some hints.

Dave

Re: Fuseki lock-up - any ideas?

Reply via email to