Hi Adrian,

I'd expect faster though there are a lot of environmental factors. Lots of question below ...

I just tried loading 25million BSBM triples into Fuseki server with TDB2 on my machine (32G RAM, SATA SSD) and it took 7 minutes (59 ktriples/s) using s-post to send the single file. 11m40s for a named graph (37K triples/s). 2G heap.

I've loaded 200+million on the default setup into a live server using TDB2 before.

On 29/06/2021 14:07, Adrian Gschwend wrote:
Hi everyone,

We have automated pipelines that write to Fuseki using the SPARQL Graph
Store protocol. This seems to work fine for smaller junks of data but
when we write a larger dataset of around 15 million triples in one
batch, this seems to fail.

Details matter though ...

How is the data being sent? What's the client software?
Does the data have a lot of long literals?

Is it loading into the default graph or a named graph?

Is it "same machine" or are sender and server on different machines?

What does the log have in it? And what's in the log if running "verbose" which prints more HTTP details.

If the server also live, running queries?

Which form of Fuseki? The low level of HTTP is provided by the web server - Jetty or Tomcat.

I presume this is not a set the dataset is nested in some other functionality?

(and what's the version, though nothing has changed directly but you never know... maybe a dependency)

If the data is available, I can try to run it at my end.

(And also in an emerging update of the GSP code including running on HTTP/2)

After checking out what happens, we see an OOM error.

TDB1:: There is a limitation on the single transaction as it requires temporary heap space. With TDB1, sending chunks avoids the limit (unless the server is under other load and can't find the time to flush the transaction to the database from the journal).

TDB2:: There is no such a limitation nor is it affected by a concurrent read load holding up freeing resources.

In fact, TDB2 does some of the work while the transaction is in-progress that TDB1 does at the end.

We send application/n-triples so I was expecting that it streams it.

Yes, if it can.

When using tdbloader this size is not really an issue at all.

tdbloader (TDB1), loading into empty is handled differently from a non-empty database.

Loading in Fuseki is not full "tdbloader" in either TDB1 or TDB2.


In this particular setup we first used TDB, the machine has 6GB of
memory assigned.

As in -Xmx6G on what size of machine? If 8G-ish, it's going to suffer lack of space in the disk cache. 2G is likely fine.

Is the storage spinning disk or SSD?

TDB2 seems to behave a bit better, it runs through without OOM but takes
1.5 hours for the job while it is less than 15 minutes when we split it
into smaller junks and send it in ~100k triples batches via Graph Store
Protocol.

That is a bit slow.


Interestingly we never see more than 1GB of RAM used so I'm even more
confused.

TDB2 uses space in two ways - the node table cache and the indexes.

The indexes are not cached in the heap. They are cached by the OS in the file system cache and access by memory mapping.

If that space is squeezed by Java growing the heap, it can become slow.


Is this OOM error to be expected for large graph-store writes?

TDB1 - yes.

Also, if the server is in use for reads, the read load can block TDB1 doing some of the finalization and that causes keeping data in memory longer (as well safe in the journal).

TDB2 - there's a reason why it is not TDB1 :-)

    Andy



regards

Adrian

Reply via email to