On Fri, Aug 16, 2013 at 4:17 PM, Andy Seaborne <[email protected]> wrote:
> Hi Stephen,
>
>
> On 16/08/13 20:20, Stephen Allen wrote:
>>
>> On Thu, Aug 15, 2013 at 8:47 AM, Andy Seaborne <[email protected]>
>> wrote:
>
> ...
>
>>> There is supposed to be a specific implement for deleteAny which is
>>> like GraphTDB.removeWorker.  But there isn't.   Actually, I don't
>>> see why GraphTDB.removeWorker needs to exist if a proper
>>> DatasetGraphTDB.deleteAny existed.
>>>
>>> Recorded as JENA-513.
>>>
>>> I'll sort this out by moving the GraphTDB.removeWorker to
>>> DatasetGraphTDB and use for deleteAny(...) and from
>>> GraphTDB.remove.
>>>
>>> The GraphTDB.removeWorker code gets batches of 1000 items, deletes
>>> them and tries again until there is nothing more matching the
>>> delete pattern.  Deletes are not done by iterator.
>>>
>>
>> So as an alternative, you can use SPARQL Update combined with
>> setting the ARQ.spillToDiskThreshold parameter to a reasonable value
>> (10,000 maybe?).  This will enable stream-to-disk functionality for
>> the intermediate bindings for DELETE/INSERT/WHERE queries (as well
>> as several of the SPARQL operators in the WHERE clause, see
>> JENA-119). This should eliminate memory bounds for the most part
>> except for the TDB's BlockMgrJournal.
>
>
> JENA-513 is done so do have a look at the code if your interested.  Do
> you see an advantage of spill-to-disk?  Doesn't it have to materialize the
> triples?
>
> I would have thought the removeWorker loop should be more efficient
> because it works in NodeId space and does not touch the NodeTable.  The
> repeated use of find() is efficient because B+Tree branch blocks are
> in-memory with very high probability. And there are no temporary files to
> worry about.
>

Yeah, there is no real benefit to doing the SPARQL Update way, just
wanted to point it out.  But if he is running in-process, then the
removeWorker is much better.

>
>>> If there were a spill cache for BlockMgrJournal that would be a
>>> great thing to have.  It's a much more direct way to get scalable
>>> transactions and works without a DB format change.
>>>
>>
>> Agreed.  Unfortunately the *DataBag classes require all data to be
>> written before any reading occurs, which makes them inappropriate.
>> Can't we just use another disk-backed B+Tree as a temporary store
>> here instead of the in-memory HashMap?
>
>
> If we used some other B+Tree implementation, then maybe.
>
> FYI: This is new ...
> http://directory.apache.org/mavibot/vision.html
>
> solving a different B+Tree issue.
>
> The requirement is a persistent hash map from block id (8 bytes) to block
> bytes (up to 8K of data).  This is happening underneath the index B+Trees
> during a transaction.
>
> TDB B+Trees are fixed size key and value and not designed for storing 8K
> blocks (no very large value support - they use 8K blocks so can't store an
> 8K block as there are a few bytes of overhead).  They are optimized for
> range scans.
>

Ah, of course, the B+Trees can't really store blocks bigger than their
fixed block size.

> There is a external hash table in TDB as well, but again, not designed for
> storing 8K units.
>
> In fact, a regular filesystem directory and one file per 8K spilled block
> would be a good mock up.  That would utilize the OS file caching, there is
> no need to sync the files and is quite easy to do and debug.

Do we even need to separate it out into files?  How about a single
memory-mapped file with an in-memory index?  Might be faster and would
use less resources such as file handles.


> An advantage of early spill-to-journal, and an in-memory tombstone
> (~10bytes) (or a B+tree of tombstones) is that this can be the final write
> of the data to the journal.  An off-journal temporary store means it has to
> be written to the off-journal store, then read in and written to the
> journal.  That's extra I/O, probably real I/O as well with disk head
> movement (read from one place, write to another).
>
> The only oddity is that the journal is append-only (does not have to be
> within the current uncommitted transaction but append only files are faster
> than random access files). If a block gets spilled, is then updated, and
> then spilled again we assume it's sufficiently rare that appending a second,
> newer copy to the journal to overwrite the first on playback is acceptable.
> Not perfect but the write-in-place to the active journal area may cost more
> (it's more likely to need a disk seek).
>
> (Yes, the journal on an SSD is a good idea.)
>

Agreed.  Spilling directly to the journal is much better than some
other temporary store if we can do it.

-Stephen

Reply via email to