On 28/04/2021 22:36, Brandon Sara wrote:
My Setup:
I’m running a few fuseki servers via Docker containers. I need the storage to
be persistent across container restarts, so I’m using TDB2 for my storage. The
TDB2 database are stored on a volume that is mounted to the Docker containers.
What is the storage for the database? EBS disk? EBS SSD?
This volume is part of our S3 instance. The Fuseki servers’ individual DBs are
kept in sync using RDF-Delta. The dataset in question is using full text search
using jena-text (lucene) with two properties being indexed (though, they occur
often in the dataset). The reasoner being used is `TransitiveReasoner`. I have
only one default graph and no other graphs.
My Problem:
To upload ~10 MB of data (in a ttl file format), it is taking sometimes more
than 3 hours to complete! We tried turning off full text search and it cut the
time in ~half.
OK - so indexing is costing 1.5 hours which is a long time suggesting
the storage is very slow. What is the lucene index stored on? Same as
the TDB2 database?
If it is a single file, the S3-write is going to be a single commit and
a single S3 block. S3 isn't a filesystem but
But still 1.5 hours for only 10MB of triple data is waaaay too long. Does
anyone have any ideas of how we could fix this issue (except the obvious to not
use a network connected disk)?
10Mb is how many triples? and how many are indexed into Lucene?
I think you'll need to experiment with simplified setups to see where
the time is going. This includes making sure the heap isn't doing a lot
of work.
Andy
Thanks.