On 29/03/2023 10:52, Osma Suominen wrote:
Hi Andy,
thanks for your quick response!
Andy Seaborne kirjoitti 29.3.2023 klo 12.20:
Previous reports about this have been hitting disk limits disk, other
OS processes touching the files (including if a shared file system)
and I/O errors. External environment factors that happen silently a
significant time before problem emerges.
Unfortunately, reports don't always get completed - there's a report,
they try some things out, we don't hear anything more. We don't get a
picture what actually happened nor what worked.
I understand that these kinds of intermittent problems can be hard to
debug and the cause can be an external factor. It's possible that this
happened in our case as well. The machines are virtual servers running
under VMWare and they have their own XFS file systems based on LVM on
(virtual) block devices. In my understanding there is nothing else than
Fuseki itself that could be performing write operations on the Fuseki
database files. The disks have never been full. This happened on two
separate (though very similar) machines, a few days apart.
The one Jena related issue was compact in the presence of updates.
Compact got significant robustness improvements at 4.6.x.
https://github.com/apache/jena/issues/1252
https://github.com/apache/jena/pull/1456
It should work safely to compact an online database. Note that a
compact is "write" operation so while the compact is running
concurrent writers are held up. Outstanding concurrent readers can
continue, new concurrent readers can start during compaction.
Good to know! We do not currently use the compact functionality in
Fuseki, so I don't think it can be a factor in this.
Anything is possible but Jena use of thrift is java-only and Thrift
enforces the union-defined assumption.
The "type 0" means it is reading some corrupted at a lower level.
Union is used for all RDF terms. Unless you have node extensions
(needs Java code), thisis code that is executed a lot.
https://github.com/apache/jena/blob/16c9a8295d78a19787bdaa05b359af97ba00dcab/jena-arq/Grammar/RDF-Thrift/BinaryRDF.thrift#L68
We are using stock Apache Jena Fuseki builds. Nothing very customized
except for some moderately complex jena-text configuration.
In my understanding Thrift is an RPC framework. I'm not sure I
understand very well how it is used within Jena, when handling regular
SPARQL queries coming in via HTTP to Fuseki. Are Thrift objects stored
in TDB2? (The problem seemed to persist across Fuseki restarts.)
Basically I'm wondering how it's possible that "Thrift enforces the
union-defined assumption" but still there was a Thrift object that
apparently didn't follow it. How was it created? Or was it created,
serialized to disk, somehow corrupted on-disk and then read back?
I don't think it is union-related. Anything broken is going to look like
a union. All RDF terms are unions. I think it's looking into the middle
of a messed up set of bytes for a term.
But if it is a broken union, it would be a write a short Java program
that takes a node, serializes it and then can't unserialise it. (I
think call cases are in the test suite.) That's all deterministic.
The reports are for occasional, random errors that can't be reproduced
after a rebuild.
Andy
-Osma