On 15/08/13 10:21, Knut-Olav Hoven wrote:
Hi!

Hi there - thanks for the detailed report.


Two issues, related to memory usage. Import and delete of large graphs.

I am currently doing some tests with 128MB heap with a little over 1M
tuples.
I know I can throw a lot of memory onto the problem, but sooner or later I
will run out.

There are some fixed size caches (as you've discovered) - 128M is likely to be to small for them.

I've noticed that TDB takes the complete resultset into memory when calling
"DatasetGraphTDB.deleteAny" before looping over all of them to delete them.
This makes a problem for very large graphs if I try to delete the entire
graph or a large selection.

There is supposed to be a specific implement for deleteAny which is like GraphTDB.removeWorker. But there isn't. Actually, I don't see why GraphTDB.removeWorker needs to exist if a proper DatasetGraphTDB.deleteAny existed.

Recorded as JENA-513.

I'll sort this out by moving the GraphTDB.removeWorker to DatasetGraphTDB and use for deleteAny(...) and from GraphTDB.remove.

The GraphTDB.removeWorker code gets batches of 1000 items, deletes them and tries again until there is nothing more matching the delete pattern. Deletes are not done by iterator.

That said, having the code for iterator remove for RecordRangeIterator and in TupleTable would be excellent regardless of this. When I went looking for BTree code originally, I found various possibilities but all too closely tied to their usage to be reusable. We could pull out the B+Tree code into a reusable module.

There are some RecordRangeIterator iterator cases that will not work with Iterator.delete ... for example, when the B+Tree is not on the same machine as the TupleIndex client.

I figured out a way to make the iterators backed by indexes/nodes and can
now delete each directly from the iterator. Just hope I have covered all
cases by implementing remove() in RecordRangeIterator and in TupleTable
(connected to all indexes). This was the "easy" part.

The difficult part is the Transaction and Journal which doesn't write to
the journal before the transaction is just about to be committed. This
means that there becomes many Block objects kept in memory in the HashMap
"BlockMgrJournal.writeBlocks".

Yes - this is a limitation of the current transaction system. The blocks may still be accessed so they can't be written to the journal and forgotten. There could be a cache that knows where the block is in the journal and fetches it back (minor but them the journal is jumbled and if in numerical block order, the writes for flushing back to the disk are likely more efficient).

My very long term approach would be to use immutable B+Trees where the blocks tree to the root are copied when a block first changes. This means that transactional data is written once, during the write transaction. Commit means switch to the new root for all subsequent transactions. Old trees remain. The hard part is that tree needs to garbage collected. Typically, this is done by a background task writing a new copy. c.f. CouchDB, BDB-JE (?) and Mulgara (not B+Trees but same approach) amongst others.

This is a not insignificant rewrite of the B+Tree ad BlockMgr code.

If there were a spill cache for BlockMgrJournal that would be a great thing to have. It's a much more direct way to get scalable transactions and works without a DB format change.

Trying to fix this by just writing to the journal directly results in
another issue in all those unit tests that open multiple transactions. The
problem is that the journal is not replayed onto the database files if
there are any transactions open. The reason for why BlockMgrJournal works
in those tests are that the writesBlocks HashMap are never cleared after
transaction (and the other transactions hit that one instead of the backing
files).

I also encountered a case during import that led to a corrupt database that
I could not recorver. Always got an exception from "ObjectFileStorage.read"
telling me that I had an "Impossibly large object".

Those cases always started with an OutOfMemoryError during import while
writing to the database files. By lowering the caches Node2NodeIdCacheSize
and NodeId2NodeCacheSize and splitting import files into smaller
batches/transactions it went fine. It seems to recover by just returning an
empty ByteBuffer instead of throwing the exception, but it would just cover
up a bad state I guess. Maybe there might be some optimization that can be
done to the part where the journal is spooled onto the database files to
avoid the OutOfMemoryError issue all together to avoid corrupt databases.

Sorry - if "Impossibly large object" happens the database is unrecoverable. The problem happened at write time - it's just detected at read time.

Should I open some issues in Jira?

Please do.

I can provide some patches for the iterators remove() functions.

Awesome.



Sincerely,

Knut-Olav Hoven
NRK, Norwegian Broadcaster Corporation


        Andy


Reply via email to