Batched update

Andy Seaborne Wed, 06 Nov 2013 06:44:30 -0800

JENA-528

== History

The BulkUpdateHandler interface is supposed to help system optimizelarge(r) scale addition and deletion of triple (it's graph-centric sodatasets don't feature here; it predates datasets).

The API contract is that a collection of triples can be added or bedeleted in a single operation. There is no possibilities of looking atthe data while a bulk operation is in progress (its a single API call).The default implementation of SimpleBulkUpdateHandler turned theseinto a loop.

This contract has problems: how big is a batch? does it have to residein memory? (the nearest to avoiding that is a pull-style passing in anIterator but that is hard to use).

ARP (the RDF/XML parser) did use batching. It builds an array of 1000triples and sends them into a model in such batches. Other parsers didnot; they do can signal start/end of parse run though via a differentmechanism.


= RDB

The original work was triggered by RDB.

The gain was because a single DB transaction could be used for multipleadditions. Without any other information, the current state of the JDBCconnection would be used, and that could be autocommit which has severelimitations.


= SDB

SDB does have a bulk loader - it manipulates temporary tables in thedatabase to accumulate and manipulate the data to insert (e.g.de-duplication). It also gets away from the JDBC/autocommit issue.


The temporary tables are not always transaction safe (depends on the

DB). SDB even copes with bulk deletion (but I'm unclear about whathappens if a a triple is inserted and deleted in the same batch asinserts and deletes are treated as two separate groups so interleavingis lost).

A batch is ideally in the 10k-20K range. These batches accumulate - nota single API operation.

Batching is done internally (even if the client app did). It istriggered by explicit SDB-specific calls or graph eventsGraphEvents.startRead and GraphEvents.finishRead. These can be calledmanually by the application; they cause the appropriate GraphSDBoperations to be called and these can be called explicitly as well.


Batched additions can't be seen by Graph.find during the batch update.

http://jena.apache.org/documentation/sdb/loading_data.html

The bulk triple API operations were overlaid on this mechanism.

= TDB

TDB has a bulk loader - it loads empty databases. It's idea of a batchis the whole data stream to be loaded. It does not need to know the endof the batch when it starts a batch. It's idea of a batch is millionsof triples, beyond what can be buffered in RAM.

Bulk upload is not transactional. Typically, it's a separate step usingone of the bulk loader utilities. It handles triples and quads. It doesnot use the BulkUpdateHandler.

For TDB, batches into a non-empty database are not special - they couldbe and it might be advantageous for some situations but currently itdoes not do anything special in this case. In the future, Lizard (whichis derived from TDB) would like to see batches for insertion; it couldmake use of bulk insertion even on non-empty databases.

TDB does nothing special about deletes (it does have, separately, havean optimized path for Graph.remove and Graph.clear).

When active, the bulk loader assumes total control over the database -any other operation (e.g. looking at the data) is likely to go wrong(very, very wrong!) - and it manipulate database details at a very lowlevel.


For 100's of millions of triple, bulkloading is the only way to go.

== Towards Requirements

So we have related-but-different mechanisms in different places.

* Is bulk deletion an issue worth addressing?

Do any other systems have any bulk optimizations for deletion (but not"delete all").


* What about a mixture of adds/deletes?

* What is the contract e.g. parallel uses of Graph.find?

* What's the unit? Graphs, datasets other ?

Separately, there is a graph operation Graph.remove(S,P,O) where S,P,Ocan be Node.ANY so it's a pattern.


        Andy

Batched update

Reply via email to