Re: Thrift problem / corruption on large TDB2 Fuseki dataset

Andy Seaborne Wed, 29 Mar 2023 04:31:52 -0700



On 29/03/2023 10:52, Osma Suominen wrote:

Hi Andy,

thanks for your quick response!

Andy Seaborne kirjoitti 29.3.2023 klo 12.20:
Previous reports about this have been hitting disk limits disk, otherOS processes touching the files (including if a shared file system)and I/O errors. External environment factors that happen silently asignificant time before problem emerges.
Unfortunately, reports don't always get completed - there's a report,they try some things out, we don't hear anything more. We don't get apicture what actually happened nor what worked.
I understand that these kinds of intermittent problems can be hard todebug and the cause can be an external factor. It's possible that thishappened in our case as well. The machines are virtual servers runningunder VMWare and they have their own XFS file systems based on LVM on(virtual) block devices. In my understanding there is nothing else thanFuseki itself that could be performing write operations on the Fusekidatabase files. The disks have never been full. This happened on twoseparate (though very similar) machines, a few days apart.
The one Jena related issue was compact in the presence of updates.

Compact got significant robustness improvements at 4.6.x.

https://github.com/apache/jena/issues/1252
https://github.com/apache/jena/pull/1456
It should work safely to compact an online database. Note that acompact is "write" operation so while the compact is runningconcurrent writers are held up. Outstanding concurrent readers cancontinue, new concurrent readers can start during compaction.
Good to know! We do not currently use the compact functionality inFuseki, so I don't think it can be a factor in this.
Anything is possible but Jena use of thrift is java-only and Thriftenforces the union-defined assumption.
The "type 0" means it is reading some corrupted at a lower level.
Union is used for all RDF terms. Unless you have node extensions(needs Java code), thisis code that is executed a lot.
https://github.com/apache/jena/blob/16c9a8295d78a19787bdaa05b359af97ba00dcab/jena-arq/Grammar/RDF-Thrift/BinaryRDF.thrift#L68
We are using stock Apache Jena Fuseki builds. Nothing very customizedexcept for some moderately complex jena-text configuration.
In my understanding Thrift is an RPC framework. I'm not sure Iunderstand very well how it is used within Jena, when handling regularSPARQL queries coming in via HTTP to Fuseki. Are Thrift objects storedin TDB2? (The problem seemed to persist across Fuseki restarts.)
Basically I'm wondering how it's possible that "Thrift enforces theunion-defined assumption" but still there was a Thrift object thatapparently didn't follow it. How was it created? Or was it created,serialized to disk, somehow corrupted on-disk and then read back?

I don't think it is union-related. Anything broken is going to look likea union. All RDF terms are unions. I think it's looking into the middleof a messed up set of bytes for a term.

But if it is a broken union, it would be a write a short Java programthat takes a node, serializes it and then can't unserialise it. (Ithink call cases are in the test suite.) That's all deterministic.

The reports are for occasional, random errors that can't be reproducedafter a rebuild.


    Andy


-Osma

Re: Thrift problem / corruption on large TDB2 Fuseki dataset

Reply via email to