On 29.06.21 20:29, Andy Seaborne wrote:
Hi Andy,
> I'd expect faster though there are a lot of environmental factors. Lots
> of question below ...
good point
> I've loaded 200+million on the default setup into a live server using
> TDB2 before.
ok good to know. I started to have some doubts, that's why I asked.
> How is the data being sent? What's the client software?
In this test it's pure curl to a named graph:
curl -X PUT \
-n \
-H Content-Type:application/n-triples \
-T scope.nt \
-G $SINK_ENDPOINT_URL \
--data-urlencode graph=https://some-named-graph/graph/ais-metadata
> Does the data have a lot of long literals?
What is "long" in that context? It's archival records so they indeed do
have longer literals, at least partially.
> Is it "same machine" or are sender and server on different machines?
different machine, endpoint is in a hosted kubernetes cluster. There is
obviously some overhead because of the line but it should not be a big
issue in this setup.
> What does the log have in it? And what's in the log if running "verbose"
> which prints more HTTP details.
What would be of interest here?
But that reminds me of something else, it's a custom Fuseki version we
made with Open Telemetry integrated so we can get a lot more tracing:
https://github.com/zazuko/docker-fuseki-otel
(we started doing that when having problems that were almost impossible
to debug otherwise as the final sender will also be somewhere else with
proxy & other stuff that makes debugging super hard).
> If the server also live, running queries?
nothing extraordinary right now no, still early phase.
> Which form of Fuseki? The low level of HTTP is provided by the web
> server - Jetty or Tomcat.
The zip we use is taken from maven, it's
apache-jena-fuseki-${JENA_VERSION}.zip, not sure what this one is using?
Source:
https://github.com/zazuko/docker-fuseki-otel/blob/main/image/Dockerfile#L22
> I presume this is not a set the dataset is nested in some other
> functionality?
>
> (and what's the version, though nothing has changed directly but you
> never know... maybe a dependency)
good point after we added open telemetry I think we did not go back to
original fuseki with no modifications anymore.
> If the data is available, I can try to run it at my end.
it is:
http://ktk.netlabs.org/misc/rdf/scope.nt.gz
> TDB1:: There is a limitation on the single transaction as it requires
> temporary heap space. With TDB1, sending chunks avoids the limit (unless
> the server is under other load and can't find the time to flush the
> transaction to the database from the journal).
ok that is pretty much what we experienced. In other words in this setup
TDB1 will always have this limitation, good to know thanks.
> TDB2:: There is no such a limitation nor is it affected by a concurrent
> read load holding up freeing resources.
>
> In fact, TDB2 does some of the work while the transaction is in-progress
> that TDB1 does at the end.
excellent. With TDB1 we never managed to write everything without OOM,
with TDB2 it's slow but we could write the full batch.
>> We send application/n-triples so I was expecting that it streams it.
>
> Yes, if it can.
ok
> Loading in Fuseki is not full "tdbloader" in either TDB1 or TDB2.
ok I expected tdbloader "cheats" as it's super fast. Not a problem per
se obviously. We have another setup where we load with tdbloader and
then replace the instance in kubernetes. No outside writes allowed in
that setup.
> As in -Xmx6G on what size of machine? If 8G-ish, it's going to suffer
> lack of space in the disk cache. 2G is likely fine.
ok will check with my devops colleagues, not sure.
> Is the storage spinning disk or SSD?
same
>> TDB2 seems to behave a bit better, it runs through without OOM but takes
>> 1.5 hours for the job while it is less than 15 minutes when we split it
>> into smaller junks and send it in ~100k triples batches via Graph Store
>> Protocol.
>
> That is a bit slow.
that was my feeling too.
> If that space is squeezed by Java growing the heap, it can become slow.
ok will check the setup.
> TDB2 - there's a reason why it is not TDB1 :-)
that is very good to know. So far we mainly used it in tdbloader setups
so apparently the issues with TDB1 were less a problem for our use-cases.
thanks for the feedback so far!
regards
Adrian