Re: Thrift problem / corruption on large TDB2 Fuseki dataset

Andy Seaborne Wed, 29 Mar 2023 02:20:29 -0700

>> org.apache.jena.tdb2.TDBException: NodeTableTRDF/Read
>> Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized
>> type 0

This is a symptom, not a cause. Reads don't cause node table updates(your update looks like it is from a read part of the update steps in aCLEAR).

Previous reports about this have been hitting disk limits disk, other OSprocesses touching the files (including if a shared file system) and I/Oerrors. External environment factors that happen silently a significanttime before problem emerges.

Unfortunately, reports don't always get completed - there's a report,they try some things out, we don't hear anything more. We don't get apicture what actually happened nor what worked.


The one Jena related issue was compact in the presence of updates.

Compact got significant robustness improvements at 4.6.x.

https://github.com/apache/jena/issues/1252
https://github.com/apache/jena/pull/1456

It should work safely to compact an online database. Note that a compactis "write" operation so while the compact is running concurrent writersare held up. Outstanding concurrent readers can continue, new concurrentreaders can start during compaction.


> The Impala issue says: "Since IMPALA-1048 we write
> TRuntimeProfileNode.node_metadata unconditionally, even when both its
> fields are unset. This trips up the Thrift library Java reader code,
> which expects to find exactly one type of a union to be set." Is it
> possible that Jena is similarly careless when writing Thrift metadata?

Anything is possible but Jena use of thrift is java-only and Thriftenforces the union-defined assumption.


The "type 0" means it is reading some corrupted at a lower level.

Union is used for all RDF terms. Unless you have node extensions (needsJava code), thisis code that is executed a lot.


https://github.com/apache/jena/blob/16c9a8295d78a19787bdaa05b359af97ba00dcab/jena-arq/Grammar/RDF-Thrift/BinaryRDF.thrift#L68

    Andy

On 29/03/2023 08:06, Osma Suominen wrote:

Hi all,
we've recently rebuilt our two main virtual servers running Fuseki whichare the backend databases of the Finto.fi vocabulary service. Afterrunning the new servers for a few weeks we've already seen two cases,one on each server, where some of the SPARQL queries start failing.
In our setup, Fuseki is managing a relatively large TDB2 database alongwith a jena-text index. We keep the data in around 50 named graphs (oneper vocabulary) and each graph is typically updated using s-put,replacing the whole graph in-place. When all data is initially loaded,the database directory takes around 25GB on one server and 44GB on theother. TDB2 tends to keep growing over time though, so around once amonth we delete the whole database and rebuild it from RDF source filesthat we keep under version control; the master data doesn't residewithin Fuseki.
When queries start failing, the Fuseki logs show long tracebacks, butthe beef seems to be these two exceptions:
org.apache.jena.tdb2.TDBException: NodeTableTRDF/Read
Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognizedtype 0
I've put the whole tracebacks into a gist [1]: there is one tracebackfor a failed SELECT query and another for a failed update query.
In both cases, the database size had grown to over 100GB when thishappened. A restart of Fuseki didn't help, but rebuilding the wholedatabase made the problem go away - for now. But this is making meworried that it could happen again any time. I'm sorry that I don't havean easy way of reproducing the problem at this time, as it only seems tohappen after Fuseki has been running for a few weeks and done manydifferent update operations on the TDB2 dataset.
I searched online for similar issues and exceptions. I could only findone gist [2] showing a very similar traceback, which apparently happenedwhen running a compact operation on Fuseki 4.3.1. There was also asimilar issue [3] reported in Apache Impala, and there the problem wasfixed by adding more careful checks into the code writing Thrift nodemetadata. But that code is written in C++, so it's a bit hard to comparethat to the Jena codebase.
The Impala issue says: "Since IMPALA-1048 we writeTRuntimeProfileNode.node_metadata unconditionally, even when both itsfields are unset. This trips up the Thrift library Java reader code,which expects to find exactly one type of a union to be set." Is itpossible that Jena is similarly careless when writing Thrift metadata?
This happened with Fuseki version 4.6.1, since we did the install justbefore the 4.7.0 release. I've just upgraded one of the machines to4.7.0 to see if it makes a difference. I can see that libthrift wasupdated from 0.16.0 to 0.17.0 in PR #1570, which happened in between thetwo Jena releases. It's possible that the problem has already been fixedthere. In that case, I'm really sorry for the noise.
Is there anything I could do to help debug the problem? For now I willjust keep monitoring the Fuseki instances to see if this happens again,especially with the new version.
Information about the setup:

OS: Rocky Linux 9.1 (RHEL based)
Kernel/arch: 5.14.0 x86_64
Java: openjdk version "11.0.18" 2023-01-17 LTS

Cheers,
Osma


[1] https://gist.github.com/osma/d61281160e84ea74e9d7dbc155ffaf69

[2] https://gist.github.com/jeffreycwitt/e7c270aae46f403845c87aa57e4b82af

[3] https://issues.apache.org/jira/browse/IMPALA-8252

Re: Thrift problem / corruption on large TDB2 Fuseki dataset

Reply via email to