Hi all,

we've recently rebuilt our two main virtual servers running Fuseki which are the backend databases of the Finto.fi vocabulary service. After running the new servers for a few weeks we've already seen two cases, one on each server, where some of the SPARQL queries start failing.

In our setup, Fuseki is managing a relatively large TDB2 database along with a jena-text index. We keep the data in around 50 named graphs (one per vocabulary) and each graph is typically updated using s-put, replacing the whole graph in-place. When all data is initially loaded, the database directory takes around 25GB on one server and 44GB on the other. TDB2 tends to keep growing over time though, so around once a month we delete the whole database and rebuild it from RDF source files that we keep under version control; the master data doesn't reside within Fuseki.

When queries start failing, the Fuseki logs show long tracebacks, but the beef seems to be these two exceptions:

org.apache.jena.tdb2.TDBException: NodeTableTRDF/Read
Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized type 0
I've put the whole tracebacks into a gist [1]: there is one traceback for a failed SELECT query and another for a failed update query.

In both cases, the database size had grown to over 100GB when this happened. A restart of Fuseki didn't help, but rebuilding the whole database made the problem go away - for now. But this is making me worried that it could happen again any time. I'm sorry that I don't have an easy way of reproducing the problem at this time, as it only seems to happen after Fuseki has been running for a few weeks and done many different update operations on the TDB2 dataset.

I searched online for similar issues and exceptions. I could only find one gist [2] showing a very similar traceback, which apparently happened when running a compact operation on Fuseki 4.3.1. There was also a similar issue [3] reported in Apache Impala, and there the problem was fixed by adding more careful checks into the code writing Thrift node metadata. But that code is written in C++, so it's a bit hard to compare that to the Jena codebase.

The Impala issue says: "Since IMPALA-1048 we write TRuntimeProfileNode.node_metadata unconditionally, even when both its fields are unset. This trips up the Thrift library Java reader code, which expects to find exactly one type of a union to be set." Is it possible that Jena is similarly careless when writing Thrift metadata?

This happened with Fuseki version 4.6.1, since we did the install just before the 4.7.0 release. I've just upgraded one of the machines to 4.7.0 to see if it makes a difference. I can see that libthrift was updated from 0.16.0 to 0.17.0 in PR #1570, which happened in between the two Jena releases. It's possible that the problem has already been fixed there. In that case, I'm really sorry for the noise.

Is there anything I could do to help debug the problem? For now I will just keep monitoring the Fuseki instances to see if this happens again, especially with the new version.

Information about the setup:

OS: Rocky Linux 9.1 (RHEL based)
Kernel/arch: 5.14.0 x86_64
Java: openjdk version "11.0.18" 2023-01-17 LTS

Cheers,
Osma


[1] https://gist.github.com/osma/d61281160e84ea74e9d7dbc155ffaf69

[2] https://gist.github.com/jeffreycwitt/e7c270aae46f403845c87aa57e4b82af

[3] https://issues.apache.org/jira/browse/IMPALA-8252

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Reply via email to