JENA-528

== History

The BulkUpdateHandler interface is supposed to help system optimize large(r) scale addition and deletion of triple (it's graph-centric so datasets don't feature here; it predates datasets).

The API contract is that a collection of triples can be added or be deleted in a single operation. There is no possibilities of looking at the data while a bulk operation is in progress (its a single API call). The default implementation of SimpleBulkUpdateHandler turned these into a loop.

This contract has problems: how big is a batch? does it have to reside in memory? (the nearest to avoiding that is a pull-style passing in an Iterator but that is hard to use).

ARP (the RDF/XML parser) did use batching. It builds an array of 1000 triples and sends them into a model in such batches. Other parsers did not; they do can signal start/end of parse run though via a different mechanism.

= RDB

The original work was triggered by RDB.

The gain was because a single DB transaction could be used for multiple additions. Without any other information, the current state of the JDBC connection would be used, and that could be autocommit which has severe limitations.

= SDB

SDB does have a bulk loader - it manipulates temporary tables in the database to accumulate and manipulate the data to insert (e.g. de-duplication). It also gets away from the JDBC/autocommit issue.

The temporary tables are not always transaction safe (depends on the
DB). SDB even copes with bulk deletion (but I'm unclear about what happens if a a triple is inserted and deleted in the same batch as inserts and deletes are treated as two separate groups so interleaving is lost).

A batch is ideally in the 10k-20K range. These batches accumulate - not a single API operation.

Batching is done internally (even if the client app did). It is triggered by explicit SDB-specific calls or graph events GraphEvents.startRead and GraphEvents.finishRead. These can be called manually by the application; they cause the appropriate GraphSDB operations to be called and these can be called explicitly as well.

Batched additions can't be seen by Graph.find during the batch update.

http://jena.apache.org/documentation/sdb/loading_data.html

The bulk triple API operations were overlaid on this mechanism.

= TDB

TDB has a bulk loader - it loads empty databases. It's idea of a batch is the whole data stream to be loaded. It does not need to know the end of the batch when it starts a batch. It's idea of a batch is millions of triples, beyond what can be buffered in RAM.

Bulk upload is not transactional. Typically, it's a separate step using one of the bulk loader utilities. It handles triples and quads. It does not use the BulkUpdateHandler.

For TDB, batches into a non-empty database are not special - they could be and it might be advantageous for some situations but currently it does not do anything special in this case. In the future, Lizard (which is derived from TDB) would like to see batches for insertion; it could make use of bulk insertion even on non-empty databases.

TDB does nothing special about deletes (it does have, separately, have an optimized path for Graph.remove and Graph.clear).

When active, the bulk loader assumes total control over the database - any other operation (e.g. looking at the data) is likely to go wrong (very, very wrong!) - and it manipulate database details at a very low level.

For 100's of millions of triple, bulkloading is the only way to go.

== Towards Requirements

So we have related-but-different mechanisms in different places.

* Is bulk deletion an issue worth addressing?

Do any other systems have any bulk optimizations for deletion (but not "delete all").

* What about a mixture of adds/deletes?

* What is the contract e.g. parallel uses of Graph.find?

* What's the unit? Graphs, datasets other ?

Separately, there is a graph operation Graph.remove(S,P,O) where S,P,O can be Node.ANY so it's a pattern.

        Andy

Reply via email to