On 15/08/13 10:21, Knut-Olav Hoven wrote:
Hi!
Hi there - thanks for the detailed report.
Two issues, related to memory usage. Import and delete of large graphs.
I am currently doing some tests with 128MB heap with a little over 1M
tuples.
I know I can throw a lot of memory onto the problem, but sooner or later I
will run out.
There are some fixed size caches (as you've discovered) - 128M is likely
to be to small for them.
I've noticed that TDB takes the complete resultset into memory when calling
"DatasetGraphTDB.deleteAny" before looping over all of them to delete them.
This makes a problem for very large graphs if I try to delete the entire
graph or a large selection.
There is supposed to be a specific implement for deleteAny which is like
GraphTDB.removeWorker. But there isn't. Actually, I don't see why
GraphTDB.removeWorker needs to exist if a proper
DatasetGraphTDB.deleteAny existed.
Recorded as JENA-513.
I'll sort this out by moving the GraphTDB.removeWorker to
DatasetGraphTDB and use for deleteAny(...) and from GraphTDB.remove.
The GraphTDB.removeWorker code gets batches of 1000 items, deletes them
and tries again until there is nothing more matching the delete pattern.
Deletes are not done by iterator.
That said, having the code for iterator remove for RecordRangeIterator
and in TupleTable would be excellent regardless of this. When I went
looking for BTree code originally, I found various possibilities but all
too closely tied to their usage to be reusable. We could pull out the
B+Tree code into a reusable module.
There are some RecordRangeIterator iterator cases that will not work
with Iterator.delete ... for example, when the B+Tree is not on the same
machine as the TupleIndex client.
I figured out a way to make the iterators backed by indexes/nodes and can
now delete each directly from the iterator. Just hope I have covered all
cases by implementing remove() in RecordRangeIterator and in TupleTable
(connected to all indexes). This was the "easy" part.
The difficult part is the Transaction and Journal which doesn't write to
the journal before the transaction is just about to be committed. This
means that there becomes many Block objects kept in memory in the HashMap
"BlockMgrJournal.writeBlocks".
Yes - this is a limitation of the current transaction system. The
blocks may still be accessed so they can't be written to the journal and
forgotten. There could be a cache that knows where the block is in the
journal and fetches it back (minor but them the journal is jumbled and
if in numerical block order, the writes for flushing back to the disk
are likely more efficient).
My very long term approach would be to use immutable B+Trees where the
blocks tree to the root are copied when a block first changes. This
means that transactional data is written once, during the write
transaction. Commit means switch to the new root for all subsequent
transactions. Old trees remain. The hard part is that tree needs to
garbage collected. Typically, this is done by a background task writing
a new copy. c.f. CouchDB, BDB-JE (?) and Mulgara (not B+Trees but same
approach) amongst others.
This is a not insignificant rewrite of the B+Tree ad BlockMgr code.
If there were a spill cache for BlockMgrJournal that would be a great
thing to have. It's a much more direct way to get scalable transactions
and works without a DB format change.
Trying to fix this by just writing to the journal directly results in
another issue in all those unit tests that open multiple transactions. The
problem is that the journal is not replayed onto the database files if
there are any transactions open. The reason for why BlockMgrJournal works
in those tests are that the writesBlocks HashMap are never cleared after
transaction (and the other transactions hit that one instead of the backing
files).
I also encountered a case during import that led to a corrupt database that
I could not recorver. Always got an exception from "ObjectFileStorage.read"
telling me that I had an "Impossibly large object".
Those cases always started with an OutOfMemoryError during import while
writing to the database files. By lowering the caches Node2NodeIdCacheSize
and NodeId2NodeCacheSize and splitting import files into smaller
batches/transactions it went fine. It seems to recover by just returning an
empty ByteBuffer instead of throwing the exception, but it would just cover
up a bad state I guess. Maybe there might be some optimization that can be
done to the part where the journal is spooled onto the database files to
avoid the OutOfMemoryError issue all together to avoid corrupt databases.
Sorry - if "Impossibly large object" happens the database is
unrecoverable. The problem happened at write time - it's just detected
at read time.
Should I open some issues in Jira?
Please do.
I can provide some patches for the iterators remove() functions.
Awesome.
Sincerely,
Knut-Olav Hoven
NRK, Norwegian Broadcaster Corporation
Andy