>> org.apache.jena.tdb2.TDBException: NodeTableTRDF/Read
>> Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized
>> type 0

This is a symptom, not a cause. Reads don't cause node table updates (your update looks like it is from a read part of the update steps in a CLEAR).

Previous reports about this have been hitting disk limits disk, other OS processes touching the files (including if a shared file system) and I/O errors. External environment factors that happen silently a significant time before problem emerges.

Unfortunately, reports don't always get completed - there's a report, they try some things out, we don't hear anything more. We don't get a picture what actually happened nor what worked.

The one Jena related issue was compact in the presence of updates.

Compact got significant robustness improvements at 4.6.x.

https://github.com/apache/jena/issues/1252
https://github.com/apache/jena/pull/1456

It should work safely to compact an online database. Note that a compact is "write" operation so while the compact is running concurrent writers are held up. Outstanding concurrent readers can continue, new concurrent readers can start during compaction.

> The Impala issue says: "Since IMPALA-1048 we write
> TRuntimeProfileNode.node_metadata unconditionally, even when both its
> fields are unset. This trips up the Thrift library Java reader code,
> which expects to find exactly one type of a union to be set." Is it
> possible that Jena is similarly careless when writing Thrift metadata?

Anything is possible but Jena use of thrift is java-only and Thrift enforces the union-defined assumption.

The "type 0" means it is reading some corrupted at a lower level.
Union is used for all RDF terms. Unless you have node extensions (needs Java code), thisis code that is executed a lot.

https://github.com/apache/jena/blob/16c9a8295d78a19787bdaa05b359af97ba00dcab/jena-arq/Grammar/RDF-Thrift/BinaryRDF.thrift#L68

    Andy

On 29/03/2023 08:06, Osma Suominen wrote:
Hi all,

we've recently rebuilt our two main virtual servers running Fuseki which are the backend databases of the Finto.fi vocabulary service. After running the new servers for a few weeks we've already seen two cases, one on each server, where some of the SPARQL queries start failing.

In our setup, Fuseki is managing a relatively large TDB2 database along with a jena-text index. We keep the data in around 50 named graphs (one per vocabulary) and each graph is typically updated using s-put, replacing the whole graph in-place. When all data is initially loaded, the database directory takes around 25GB on one server and 44GB on the other. TDB2 tends to keep growing over time though, so around once a month we delete the whole database and rebuild it from RDF source files that we keep under version control; the master data doesn't reside within Fuseki.

When queries start failing, the Fuseki logs show long tracebacks, but the beef seems to be these two exceptions:

org.apache.jena.tdb2.TDBException: NodeTableTRDF/Read
Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized type 0
I've put the whole tracebacks into a gist [1]: there is one traceback for a failed SELECT query and another for a failed update query.

In both cases, the database size had grown to over 100GB when this happened. A restart of Fuseki didn't help, but rebuilding the whole database made the problem go away - for now. But this is making me worried that it could happen again any time. I'm sorry that I don't have an easy way of reproducing the problem at this time, as it only seems to happen after Fuseki has been running for a few weeks and done many different update operations on the TDB2 dataset.

I searched online for similar issues and exceptions. I could only find one gist [2] showing a very similar traceback, which apparently happened when running a compact operation on Fuseki 4.3.1. There was also a similar issue [3] reported in Apache Impala, and there the problem was fixed by adding more careful checks into the code writing Thrift node metadata. But that code is written in C++, so it's a bit hard to compare that to the Jena codebase.

The Impala issue says: "Since IMPALA-1048 we write TRuntimeProfileNode.node_metadata unconditionally, even when both its fields are unset. This trips up the Thrift library Java reader code, which expects to find exactly one type of a union to be set." Is it possible that Jena is similarly careless when writing Thrift metadata?

This happened with Fuseki version 4.6.1, since we did the install just before the 4.7.0 release. I've just upgraded one of the machines to 4.7.0 to see if it makes a difference. I can see that libthrift was updated from 0.16.0 to 0.17.0 in PR #1570, which happened in between the two Jena releases. It's possible that the problem has already been fixed there. In that case, I'm really sorry for the noise.

Is there anything I could do to help debug the problem? For now I will just keep monitoring the Fuseki instances to see if this happens again, especially with the new version.

Information about the setup:

OS: Rocky Linux 9.1 (RHEL based)
Kernel/arch: 5.14.0 x86_64
Java: openjdk version "11.0.18" 2023-01-17 LTS

Cheers,
Osma


[1] https://gist.github.com/osma/d61281160e84ea74e9d7dbc155ffaf69

[2] https://gist.github.com/jeffreycwitt/e7c270aae46f403845c87aa57e4b82af

[3] https://issues.apache.org/jira/browse/IMPALA-8252

Reply via email to