On 25/01/2019 20:05, Amit Kumar wrote:
My team has a big knowledge graph that we want to server via a Sparql
endpoint. We are looking into using Apache Fuseki for the same. I have some
questions and was hoping someone here can guide me.

Right now, I'm working on a dataset which consists on 175 Million Triples
which translates to  around 250GB size of TDB2 table using the
tdb2.tdbloader.

The entire knowledge db is generated once a day and as per our rough count,
approx 14 Million triples ( 1.6GB uncompressed) changes(including additions
and deletion) everyday ~8%.

What is the best way to update live fuseki dataset when you have to update
such large number of triples ?

The method you describe below with INSERT DATA / DELETE DATA is fine for TDB2. It is specially handled by the parser and execution and the actions are execute streaming straight on the database.

There is another way that might be useful to you.


We have tried doing something like this

curl -X POST -d @update.txt --header "Content-type: application/sparql-update" 
-v http://localhost:9999/my/update

--data-binary is a bit better.

(-d can do some processing on the file)



Where update.txt file looks something like

DELETE DATA {
<sub1> <pred1> <obj1> .
<sub2> <pred2> <obj2> .
...
};
INSERT DATA {
<sub1> <pred1> <obj11> .
<sub2> <pred2> <obj22> .
....
}

Good.

DELETE DATA and INSERT DATA are the way to go.

It takes around 15-20 minutes on our beefy machine. I had some questions
regarding this approach

    - Does making a curl request like this warps the entire call within a
    transaction?

Yes.

    - Is there a size limit on how big a call I can make ?

No, not on the Fuseki side.

    - My understanding is that the Fuseki server will have to download the
    full file on its side and then apply the changes? Is it correct ?

No - not in the setup using TDB2 - it should

Also,
    will it affect any ongoing read requests running in parallel?

No.

One W and any number of R transactions happen in true parallel.

    - Is there any other better way  to update the db?

For bulk changes, it is better to send in this way, not send complex DELETE/INSERT/WHERE.

-----------------

An alternative is provide by RDF Delta.
https://afs.github.io/rdf-delta/

It is open source, Apache license and if the PMC accepts it, will migrate to Jena. <disclosure : I'm "afs" on github>

This is a patch format which is similar to the DELETE DATA/INSERT DATA except it can be generated as a stream of changes as they happen and it handles blank nodes.

There is a Fuseki server with a built-in patch handler:
http://central.maven.org/maven2/org/seaborne/rdf-delta/rdf-delta-dist/

Sending updates to a live server is one use case (we have that in customer production with a customer as part of my day-job).

From that, it can be used to keep several servers in-step (high availability) from one used for updates (actually, in the general case it can be a cluster of Fuseki behind a load balancer and any server can receive and execute an update).

This setup is also in deployment in a different customer's cloud infrastructure.

Just ask if you want to know more.

    Andy


Thanks for your help.

Regards
Amit

Reply via email to