I have been looking into reproducing the error locally, but haven't been able to as the LOAD commands that produced the error a couple of months ago now kill my Rancher. With a lot of restarts and Rancher configuration changes (Apple virtualization instead of QEMU and virtiofs volume mounts instead of the default one) I was able to get the LOADs working again. This was with Jena 4.9.0 and now the LOADs didn't produce the "unrecognized type 0" error anymore... and I was even able to issue more and bigger LOAD commands than before in 4.7.0.
So after getting that working, but not being able to successfully reproduce the error, I decided to try it in Jena 4.7.0... but there again my Rancher started failing when trying to do larger LOADs. So I wasn't able to reproduce the "unrecognized type 0". We did however got it a bunch of times on our TST environment in the last week (in a bunch of different scenarios). So it definitely is still occurring and also for datasets that were created in at least 4.8.0. Might it be a good idea to delete/recreate all the datasets on the instance and see if it happens again? I also had a further chat with our OPS people to check if they have any ideas about other processes that might be accessing Jena's files. The only things we could come up with were: - the EFS we're using uses encryption at rest - we're not doing backups ourselves, but whatever EFS does for backup related stuff is being used - we're running a daily compact command to free up disk space.. but that is an API call that we guess shouldn't be an issue? So we're still at a bit of a loss how and why this is happening. On Tue, 19 Sept 2023 at 23:05, Andy Seaborne <[email protected]> wrote: > Hi Jan, > > Thanks for the update. > > On 18/09/2023 19:49, Jan Eerdekens wrote: > > Hi Andy, > > > > Sorry for the late answer, but I was quite busy. > > > > The database was as far as I can tell generated in version 4.7.0 and then > > upgrades to 4.8.0 and 4.9.0 were done. Datasets were created (and some > > deleted and created again) in all these versions. > > > > The scenario that my colleague had currently isn't reproducible after he > > deleted and created his dataset again. I'd have to retry the data loads > for > > my load test scenario and see if that still triggers the issue (during > the > > load tests many months ago that was a pretty simple scenario that always > > ended in the error - but that definitely was done on version 4.7.0). I'll > > try to execute that loading code again and see what happens and open a > > Github issue if it is able to reliably produce the issue in 4.9.0. > > > > We are running Jena in a k8s cluster on AWS and it uses EFS as a file > > store. > > In case its matter, EFS is not the fastest storage for a database. > Caching tends to hide this if the caches are holding enough of the > working set but the latency is quite high. > > > As far as I know we don't have anything configured ourselves that > > would cause concurrent access, but I'll check with our OPS people to see > if > > they can identify something on the OS level that might access the files > or > > if they have setup a backup process. Currently we're only running 1 Jena > > instance per environment. > > > > regards, > > > > Jan > > > > > > > > On Wed, 30 Aug 2023 at 23:08, Andy Seaborne <[email protected]> wrote: > > > >> Hi Jan, > >> > >> On 30/08/2023 14:58, Jan Eerdekens wrote: > >>> Hi, > >>> > >>> We've been evaluating an using Jena for about 1,5 years now, but are > >>> recently running into a perplexing issue. In a lot of different > >> scenarios, > >>> ways of using Jena, we are getting the exceptions like the one below: > >>> > >> > >>> Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized > >> type > >>> 0 > >>> at > org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:140) > >>> ~[fuseki-server.jar:4.8.0] > >> > >>> The different scenarios where it has happened are: > >>> > >>> - LOADing data into a dataset > >>> - compacting a dataset > >>> - querying a dataset > >>> > >>> In all those case we've run into trouble and get an exception that > >>> mentions *org.apache.jena.tdb2.TDBException: > >>> NodeTableTRDF/Read* and *org.apache.thrift.protocol.TProtocolException: > >>> Unrecognized type 0*. > >>> > >>> What can cause this? This looks kinda similar to this mailing list > >>> question, > >> https://www.mail-archive.com/[email protected]/msg20409.html, > >>> where it seems data corruption is mentioned that potentially isn't > >>> recoverable? > >> > > >>> The first time I encountered this issue was while doing a bunch of > >>> sequential LOAD commands to prepare a large dataset for load testing. I > >>> used files of around 50mb (started off with bigger ones) and after > about > >> 20 > >>> to 25 LOADs it would get this error (also the completion time of a LOAD > >>> would go up and up). So for this scenario I was running locally (Jena > >>> Fuseki running in docker/Rancher) and only running the LOADs and not > much > >>> else except for a SELECT here and there (via the Fuseki UI) to check > that > >>> performance while LOADing. Is there a way that that could cause data > >>> corruption and the exception we're seeing? > >> > >> "Unrecognized type 0" has come up in a couple of cases. > >> > >> It means the node table is corrupt but the problem was caused silently > >> at some point in the past. The "Unrecognized type 0" exception happens > >> some time later (not a few seconds - either after a restart or a long > >> time of usage that has churned the node cache - possibly many months). > >> > >> There have been some fixes around compaction that addressed bugs in this > >> area. This has been the most common problem. > >> > >> Was this database originally create before 4.8.0? > >> > >> If not, do you have a fixed scenario so that the situation can be > >> recreated for 4.9.0? Please raise a github issue for it. > >> > >> Another situation is if another OS process interferes with the files > >> (container OS or host OS). What operating system is the host machine? > >> > >> While TDB2 endeavours to protect against multiple copies of TDB running > >> the same files, that is imperfect if it is two containers and the > >> database is on a mounted docker volume used by two containers. > >> > >> One other report seemed to be a backup process was running over the > >> files. We didn't get to the root cause of that one. > >> > >> Andy > >> > >>> > >>> regards, > >>> > >>> Jan Eerdekens > >>> > >> > > >
