On 25/01/2019 20:05, Amit Kumar wrote:
My team has a big knowledge graph that we want to server via a Sparql
endpoint. We are looking into using Apache Fuseki for the same. I have some
questions and was hoping someone here can guide me.
Right now, I'm working on a dataset which consists on 175 Million Triples
which translates to around 250GB size of TDB2 table using the
tdb2.tdbloader.
The entire knowledge db is generated once a day and as per our rough count,
approx 14 Million triples ( 1.6GB uncompressed) changes(including additions
and deletion) everyday ~8%.
What is the best way to update live fuseki dataset when you have to update
such large number of triples ?
The method you describe below with INSERT DATA / DELETE DATA is fine for
TDB2. It is specially handled by the parser and execution and the
actions are execute streaming straight on the database.
There is another way that might be useful to you.
We have tried doing something like this
curl -X POST -d @update.txt --header "Content-type: application/sparql-update"
-v http://localhost:9999/my/update
--data-binary is a bit better.
(-d can do some processing on the file)
Where update.txt file looks something like
DELETE DATA {
<sub1> <pred1> <obj1> .
<sub2> <pred2> <obj2> .
...
};
INSERT DATA {
<sub1> <pred1> <obj11> .
<sub2> <pred2> <obj22> .
....
}
Good.
DELETE DATA and INSERT DATA are the way to go.
It takes around 15-20 minutes on our beefy machine. I had some questions
regarding this approach
- Does making a curl request like this warps the entire call within a
transaction?
Yes.
- Is there a size limit on how big a call I can make ?
No, not on the Fuseki side.
- My understanding is that the Fuseki server will have to download the
full file on its side and then apply the changes? Is it correct ?
No - not in the setup using TDB2 - it should
Also,
will it affect any ongoing read requests running in parallel?
No.
One W and any number of R transactions happen in true parallel.
- Is there any other better way to update the db?
For bulk changes, it is better to send in this way, not send complex
DELETE/INSERT/WHERE.
-----------------
An alternative is provide by RDF Delta.
https://afs.github.io/rdf-delta/
It is open source, Apache license and if the PMC accepts it, will
migrate to Jena. <disclosure : I'm "afs" on github>
This is a patch format which is similar to the DELETE DATA/INSERT DATA
except it can be generated as a stream of changes as they happen and it
handles blank nodes.
There is a Fuseki server with a built-in patch handler:
http://central.maven.org/maven2/org/seaborne/rdf-delta/rdf-delta-dist/
Sending updates to a live server is one use case (we have that in
customer production with a customer as part of my day-job).
From that, it can be used to keep several servers in-step (high
availability) from one used for updates (actually, in the general case
it can be a cluster of Fuseki behind a load balancer and any server can
receive and execute an update).
This setup is also in deployment in a different customer's cloud
infrastructure.
Just ask if you want to know more.
Andy
Thanks for your help.
Regards
Amit