On Fri, Aug 16, 2013 at 4:17 PM, Andy Seaborne <[email protected]> wrote: > Hi Stephen, > > > On 16/08/13 20:20, Stephen Allen wrote: >> >> On Thu, Aug 15, 2013 at 8:47 AM, Andy Seaborne <[email protected]> >> wrote: > > ... > >>> There is supposed to be a specific implement for deleteAny which is >>> like GraphTDB.removeWorker. But there isn't. Actually, I don't >>> see why GraphTDB.removeWorker needs to exist if a proper >>> DatasetGraphTDB.deleteAny existed. >>> >>> Recorded as JENA-513. >>> >>> I'll sort this out by moving the GraphTDB.removeWorker to >>> DatasetGraphTDB and use for deleteAny(...) and from >>> GraphTDB.remove. >>> >>> The GraphTDB.removeWorker code gets batches of 1000 items, deletes >>> them and tries again until there is nothing more matching the >>> delete pattern. Deletes are not done by iterator. >>> >> >> So as an alternative, you can use SPARQL Update combined with >> setting the ARQ.spillToDiskThreshold parameter to a reasonable value >> (10,000 maybe?). This will enable stream-to-disk functionality for >> the intermediate bindings for DELETE/INSERT/WHERE queries (as well >> as several of the SPARQL operators in the WHERE clause, see >> JENA-119). This should eliminate memory bounds for the most part >> except for the TDB's BlockMgrJournal. > > > JENA-513 is done so do have a look at the code if your interested. Do > you see an advantage of spill-to-disk? Doesn't it have to materialize the > triples? > > I would have thought the removeWorker loop should be more efficient > because it works in NodeId space and does not touch the NodeTable. The > repeated use of find() is efficient because B+Tree branch blocks are > in-memory with very high probability. And there are no temporary files to > worry about. >
Yeah, there is no real benefit to doing the SPARQL Update way, just wanted to point it out. But if he is running in-process, then the removeWorker is much better. > >>> If there were a spill cache for BlockMgrJournal that would be a >>> great thing to have. It's a much more direct way to get scalable >>> transactions and works without a DB format change. >>> >> >> Agreed. Unfortunately the *DataBag classes require all data to be >> written before any reading occurs, which makes them inappropriate. >> Can't we just use another disk-backed B+Tree as a temporary store >> here instead of the in-memory HashMap? > > > If we used some other B+Tree implementation, then maybe. > > FYI: This is new ... > http://directory.apache.org/mavibot/vision.html > > solving a different B+Tree issue. > > The requirement is a persistent hash map from block id (8 bytes) to block > bytes (up to 8K of data). This is happening underneath the index B+Trees > during a transaction. > > TDB B+Trees are fixed size key and value and not designed for storing 8K > blocks (no very large value support - they use 8K blocks so can't store an > 8K block as there are a few bytes of overhead). They are optimized for > range scans. > Ah, of course, the B+Trees can't really store blocks bigger than their fixed block size. > There is a external hash table in TDB as well, but again, not designed for > storing 8K units. > > In fact, a regular filesystem directory and one file per 8K spilled block > would be a good mock up. That would utilize the OS file caching, there is > no need to sync the files and is quite easy to do and debug. Do we even need to separate it out into files? How about a single memory-mapped file with an in-memory index? Might be faster and would use less resources such as file handles. > An advantage of early spill-to-journal, and an in-memory tombstone > (~10bytes) (or a B+tree of tombstones) is that this can be the final write > of the data to the journal. An off-journal temporary store means it has to > be written to the off-journal store, then read in and written to the > journal. That's extra I/O, probably real I/O as well with disk head > movement (read from one place, write to another). > > The only oddity is that the journal is append-only (does not have to be > within the current uncommitted transaction but append only files are faster > than random access files). If a block gets spilled, is then updated, and > then spilled again we assume it's sufficiently rare that appending a second, > newer copy to the journal to overwrite the first on playback is acceptable. > Not perfect but the write-in-place to the active journal area may cost more > (it's more likely to need a disk seek). > > (Yes, the journal on an SSD is a good idea.) > Agreed. Spilling directly to the journal is much better than some other temporary store if we can do it. -Stephen
