Hi Stephen,

On 16/08/13 20:20, Stephen Allen wrote:
On Thu, Aug 15, 2013 at 8:47 AM, Andy Seaborne <[email protected]>
wrote:
...
There is supposed to be a specific implement for deleteAny which is
like GraphTDB.removeWorker.  But there isn't.   Actually, I don't
see why GraphTDB.removeWorker needs to exist if a proper
DatasetGraphTDB.deleteAny existed.

Recorded as JENA-513.

I'll sort this out by moving the GraphTDB.removeWorker to
DatasetGraphTDB and use for deleteAny(...) and from
GraphTDB.remove.

The GraphTDB.removeWorker code gets batches of 1000 items, deletes
them and tries again until there is nothing more matching the
delete pattern.  Deletes are not done by iterator.


So as an alternative, you can use SPARQL Update combined with
setting the ARQ.spillToDiskThreshold parameter to a reasonable value
(10,000 maybe?).  This will enable stream-to-disk functionality for
the intermediate bindings for DELETE/INSERT/WHERE queries (as well
as several of the SPARQL operators in the WHERE clause, see
JENA-119). This should eliminate memory bounds for the most part
except for the TDB's BlockMgrJournal.

JENA-513 is done so do have a look at the code if your interested.  Do
you see an advantage of spill-to-disk? Doesn't it have to materialize the triples?

I would have thought the removeWorker loop should be more efficient
because it works in NodeId space and does not touch the NodeTable.  The
repeated use of find() is efficient because B+Tree branch blocks are
in-memory with very high probability. And there are no temporary files to worry about.

If there were a spill cache for BlockMgrJournal that would be a
great thing to have.  It's a much more direct way to get scalable
transactions and works without a DB format change.


Agreed.  Unfortunately the *DataBag classes require all data to be
written before any reading occurs, which makes them inappropriate.
Can't we just use another disk-backed B+Tree as a temporary store
here instead of the in-memory HashMap?

If we used some other B+Tree implementation, then maybe.

FYI: This is new ...
http://directory.apache.org/mavibot/vision.html

solving a different B+Tree issue.

The requirement is a persistent hash map from block id (8 bytes) to block bytes (up to 8K of data). This is happening underneath the index B+Trees during a transaction.

TDB B+Trees are fixed size key and value and not designed for storing 8K blocks (no very large value support - they use 8K blocks so can't store an 8K block as there are a few bytes of overhead). They are optimized for range scans.

There is a external hash table in TDB as well, but again, not designed for storing 8K units.

In fact, a regular filesystem directory and one file per 8K spilled block would be a good mock up. That would utilize the OS file caching, there is no need to sync the files and is quite easy to do and debug.

An advantage of early spill-to-journal, and an in-memory tombstone (~10bytes) (or a B+tree of tombstones) is that this can be the final write of the data to the journal. An off-journal temporary store means it has to be written to the off-journal store, then read in and written to the journal. That's extra I/O, probably real I/O as well with disk head movement (read from one place, write to another).

The only oddity is that the journal is append-only (does not have to be within the current uncommitted transaction but append only files are faster than random access files). If a block gets spilled, is then updated, and then spilled again we assume it's sufficiently rare that appending a second, newer copy to the journal to overwrite the first on playback is acceptable. Not perfect but the write-in-place to the active journal area may cost more (it's more likely to need a disk seek).

(Yes, the journal on an SSD is a good idea.)

I've actually been running into this issue because now that
streaming SPARQL Update support is available, I find I am generating
and streaming so much data in a single transaction that I need to
devote a not-insignificant amount of heap just for storing the
pending blocks.

        Andy

Reply via email to