Hi Adrian,

(Fuseki version number?)

Your data: your script,  I get load rates of

Fuseki main
6m44.184s / 34k TPS (triples per second)

Fuseki in the form you used:
7m11.894s / 32k

(only one run each so these two are "the same")

which is about what I'd expect.

Datasets are "publish centric" (indexed for every access pattern) but it has an update cost.

So we seem to be down to that fact that as one big file, you get 1.5 hours but comparable when split into 100k chunks.

That I can't explain.

All I can think of is the larger-than-needed heap growing to take most of the machine and squeezing out the file system cache causing a lot more real I/O.

Some notes inline ...

    Andy

On 29/06/2021 20:49, Adrian Gschwend wrote:
On 29.06.21 20:29, Andy Seaborne wrote:

Hi Andy,

I'd expect faster though there are a lot of environmental factors. Lots
of question below ...

good point

I've loaded 200+million on the default setup into a live server using
TDB2 before.

ok good to know. I started to have some doubts, that's why I asked.

How is the data being sent? What's the client software?

In this test it's pure curl to a named graph:

curl -X PUT \

It's a PUT which clears the destination first.

"Clear" is "delete all", not a fast path, because of current transactions.

      -n \
      -H Content-Type:application/n-triples \
      -T scope.nt \
      -G $SINK_ENDPOINT_URL \
      --data-urlencode graph=https://some-named-graph/graph/ais-metadata
>
Does the data have a lot of long literals?

What is "long" in that context? It's archival records so they indeed do
have longer literals, at least partially.

I looked at the start and didn't see anything of note.

"Long" means strings beign 100's of chars long.

Is it "same machine" or are sender and server on different machines?

different machine, endpoint is in a hosted kubernetes cluster. There is
obviously some overhead because of the line but it should not be a big
issue in this setup.

What does the log have in it? And what's in the log if running "verbose"
which prints more HTTP details.

What would be of interest here?

The server outputs a long file that records

when verbose it prints the headers:

10:36:04 INFO  Fuseki     :: [1] PUT http://localhost:3030/ds
10:36:04 INFO  Fuseki     :: [1]   => Accept:              */*
10:36:04 INFO  Fuseki     :: [1]   => Expect:              100-continue
10:36:04 INFO  Fuseki     :: [1]   => User-Agent:          curl/7.74.0
10:36:04 INFO  Fuseki     :: [1]   => Host:                localhost:3030
10:36:04 INFO  Fuseki     :: [1]   => Content-Length:      59
10:36:04 INFO Fuseki :: [1] => Content-Type: application/n-triples

10:36:28 INFO Fuseki :: [1] Body: Content-Length=59, Content-Type=application/n-triples, Charset=null => N-Triples : Count=1 Triples=1 Quads=0
10:36:28 INFO  Fuseki     :: [1]   <= Content-Type:        application/json
10:36:28 INFO  Fuseki     :: [1]   <= Content-Length:      61
10:36:28 INFO Fuseki :: [1] <= Server: Apache Jena Fuseki (4.2.0-SNAPSHOT)
10:36:28 INFO  Fuseki     :: [1] 200 OK (2.677 s)

(different data in this example)

but no matter as I have a example working here now if there isn't a concurrent access load which shows up as mixed in [] records


But that reminds me of something else, it's a custom Fuseki version we
made with Open Telemetry integrated so we can get a lot more tracing:

https://github.com/zazuko/docker-fuseki-otel

404

Fuseki will output a regular NCSA log file of requests as well - by default it's off but with log4j2 you can set it to write to a file.

(we started doing that when having problems that were almost impossible
to debug otherwise as the final sender will also be somewhere else with
proxy & other stuff that makes debugging super hard).

Yes!


If the server also live, running queries?

nothing extraordinary right now no, still early phase.

Which form of Fuseki? The low level of HTTP is provided by the web
server - Jetty or Tomcat.

The zip we use is taken from maven, it's
apache-jena-fuseki-${JENA_VERSION}.zip, not sure what this one is using?

Fuseki comes as:

WAR file for Tomcat etc.

Standalone server
from that zip - which is a "webapp" (it has a UI) with Jetty.

There is also a server only "Fuseki main"
https://repo1.maven.org/maven2/org/apache/jena/jena-fuseki-server/
which is Fuseki, no UI, not a webapp.

which is the same core engine doing the same thing.

Source:
https://github.com/zazuko/docker-fuseki-otel/blob/main/image/Dockerfile#L22

404

I presume this is not a set the dataset is nested in some other
functionality?

(and what's the version, though nothing has changed directly but you
never know... maybe a dependency)

good point after we added open telemetry I think we did not go back to
original fuseki with no modifications anymore.

If the configuration is layered, it has a cost.


If the data is available, I can try to run it at my end.

it is:

http://ktk.netlabs.org/misc/rdf/scope.nt.gz

Got it!

TDB1:: There is a limitation on the single transaction as it requires
temporary heap space. With TDB1, sending chunks avoids the limit (unless
the server is under other load and can't find the time to flush the
transaction to the database from the journal).

ok that is pretty much what we experienced. In other words in this setup
TDB1 will always have this limitation, good to know thanks.

TDB2:: There is no such a limitation nor is it affected by a concurrent
read load holding up freeing resources.

In fact, TDB2 does some of the work while the transaction is in-progress
that TDB1 does at the end.

excellent. With TDB1 we never managed to write everything without OOM,
with TDB2 it's slow but we could write the full batch.

We send application/n-triples so I was expecting that it streams it.

Yes, if it can.

ok

Loading in Fuseki is not full "tdbloader" in either TDB1 or TDB2.

ok I expected tdbloader "cheats" as it's super fast. Not a problem per
se obviously. We have another setup where we load with tdbloader and
then replace the instance in kubernetes. No outside writes allowed in
that setup.

k8s - always a chance the I/O path is slower that my unvirtualized test figures.

As in -Xmx6G on what size of machine? If 8G-ish, it's going to suffer
lack of space in the disk cache. 2G is likely fine.

ok will check with my devops colleagues, not sure.

Is the storage spinning disk or SSD?

same

TDB2 seems to behave a bit better, it runs through without OOM but takes
1.5 hours for the job while it is less than 15 minutes when we split it
into smaller junks and send it in ~100k triples batches via Graph Store
Protocol.

That is a bit slow.

that was my feeling too.

If that space is squeezed by Java growing the heap, it can become slow.

ok will check the setup.

TDB2 - there's a reason why it is not TDB1 :-)

that is very good to know. So far we mainly used it in tdbloader setups
so apparently the issues with TDB1 were less a problem for our use-cases.

thanks for the feedback so far!

regards

Adrian

Reply via email to