Hi Adrian,
I'd expect faster though there are a lot of environmental factors. Lots
of question below ...
I just tried loading 25million BSBM triples into Fuseki server with TDB2
on my machine (32G RAM, SATA SSD) and it took 7 minutes (59 ktriples/s)
using s-post to send the single file. 11m40s for a named graph (37K
triples/s). 2G heap.
I've loaded 200+million on the default setup into a live server using
TDB2 before.
On 29/06/2021 14:07, Adrian Gschwend wrote:
Hi everyone,
We have automated pipelines that write to Fuseki using the SPARQL Graph
Store protocol. This seems to work fine for smaller junks of data but
when we write a larger dataset of around 15 million triples in one
batch, this seems to fail.
Details matter though ...
How is the data being sent? What's the client software?
Does the data have a lot of long literals?
Is it loading into the default graph or a named graph?
Is it "same machine" or are sender and server on different machines?
What does the log have in it? And what's in the log if running "verbose"
which prints more HTTP details.
If the server also live, running queries?
Which form of Fuseki? The low level of HTTP is provided by the web
server - Jetty or Tomcat.
I presume this is not a set the dataset is nested in some other
functionality?
(and what's the version, though nothing has changed directly but you
never know... maybe a dependency)
If the data is available, I can try to run it at my end.
(And also in an emerging update of the GSP code including running on HTTP/2)
After checking out what happens, we see an OOM error.
TDB1:: There is a limitation on the single transaction as it requires
temporary heap space. With TDB1, sending chunks avoids the limit (unless
the server is under other load and can't find the time to flush the
transaction to the database from the journal).
TDB2:: There is no such a limitation nor is it affected by a concurrent
read load holding up freeing resources.
In fact, TDB2 does some of the work while the transaction is in-progress
that TDB1 does at the end.
We send application/n-triples so I was expecting that it streams it.
Yes, if it can.
When using tdbloader this size is not really an issue at all.
tdbloader (TDB1), loading into empty is handled differently from a
non-empty database.
Loading in Fuseki is not full "tdbloader" in either TDB1 or TDB2.
In this particular setup we first used TDB, the machine has 6GB of
memory assigned.
As in -Xmx6G on what size of machine? If 8G-ish, it's going to suffer
lack of space in the disk cache. 2G is likely fine.
Is the storage spinning disk or SSD?
TDB2 seems to behave a bit better, it runs through without OOM but takes
1.5 hours for the job while it is less than 15 minutes when we split it
into smaller junks and send it in ~100k triples batches via Graph Store
Protocol.
That is a bit slow.
Interestingly we never see more than 1GB of RAM used so I'm even more
confused.
TDB2 uses space in two ways - the node table cache and the indexes.
The indexes are not cached in the heap. They are cached by the OS in the
file system cache and access by memory mapping.
If that space is squeezed by Java growing the heap, it can become slow.
Is this OOM error to be expected for large graph-store writes?
TDB1 - yes.
Also, if the server is in use for reads, the read load can block TDB1
doing some of the finalization and that causes keeping data in memory
longer (as well safe in the journal).
TDB2 - there's a reason why it is not TDB1 :-)
Andy
regards
Adrian