Re: Fuseki Graph Store Protocol: Streaming or not?

Andy Seaborne Tue, 29 Jun 2021 11:30:03 -0700

Hi Adrian,

I'd expect faster though there are a lot of environmental factors. Lotsof question below ...

I just tried loading 25million BSBM triples into Fuseki server with TDB2on my machine (32G RAM, SATA SSD) and it took 7 minutes (59 ktriples/s)using s-post to send the single file. 11m40s for a named graph (37Ktriples/s). 2G heap.

I've loaded 200+million on the default setup into a live server usingTDB2 before.


On 29/06/2021 14:07, Adrian Gschwend wrote:

Hi everyone,

We have automated pipelines that write to Fuseki using the SPARQL Graph
Store protocol. This seems to work fine for smaller junks of data but
when we write a larger dataset of around 15 million triples in one
batch, this seems to fail.


Details matter though ...

How is the data being sent? What's the client software?
Does the data have a lot of long literals?

Is it loading into the default graph or a named graph?

Is it "same machine" or are sender and server on different machines?

What does the log have in it? And what's in the log if running "verbose"which prints more HTTP details.


If the server also live, running queries?

Which form of Fuseki? The low level of HTTP is provided by the webserver - Jetty or Tomcat.

I presume this is not a set the dataset is nested in some otherfunctionality?

(and what's the version, though nothing has changed directly but younever know... maybe a dependency)


If the data is available, I can try to run it at my end.

(And also in an emerging update of the GSP code including running on HTTP/2)

After checking out what happens, we see an OOM error.

TDB1:: There is a limitation on the single transaction as it requirestemporary heap space. With TDB1, sending chunks avoids the limit (unlessthe server is under other load and can't find the time to flush thetransaction to the database from the journal).

TDB2:: There is no such a limitation nor is it affected by a concurrentread load holding up freeing resources.

In fact, TDB2 does some of the work while the transaction is in-progressthat TDB1 does at the end.

We send application/n-triples so I was expecting that it streams it.


Yes, if it can.

When using tdbloader this size is not really an issue at all.

tdbloader (TDB1), loading into empty is handled differently from anon-empty database.


Loading in Fuseki is not full "tdbloader" in either TDB1 or TDB2.


In this particular setup we first used TDB, the machine has 6GB of
memory assigned.

As in -Xmx6G on what size of machine? If 8G-ish, it's going to sufferlack of space in the disk cache. 2G is likely fine.


Is the storage spinning disk or SSD?

TDB2 seems to behave a bit better, it runs through without OOM but takes
1.5 hours for the job while it is less than 15 minutes when we split it
into smaller junks and send it in ~100k triples batches via Graph Store
Protocol.


That is a bit slow.


Interestingly we never see more than 1GB of RAM used so I'm even more
confused.


TDB2 uses space in two ways - the node table cache and the indexes.

The indexes are not cached in the heap. They are cached by the OS in thefile system cache and access by memory mapping.


If that space is squeezed by Java growing the heap, it can become slow.


Is this OOM error to be expected for large graph-store writes?


TDB1 - yes.

Also, if the server is in use for reads, the read load can block TDB1doing some of the finalization and that causes keeping data in memorylonger (as well safe in the journal).


TDB2 - there's a reason why it is not TDB1 :-)

    Andy



regards

Adrian

Re: Fuseki Graph Store Protocol: Streaming or not?

Reply via email to