On 11/10/13 08:52, Daniel Gerber wrote:
On 10.10.2013, at 12:37, Andy Seaborne <[email protected]> wrote:
On 10/10/13 10:37, Daniel Gerber wrote:
Hi,
I'm importing 20Mb of data every day into a Jena TDB store.
Before insertion, I'm deleting everything (model.removeAll()). But I
noticed that the size of the index does not shrink, it even increases
every day (it's now at 11GB and soon will hit physical limits). I
found this question [1] on stack overflow but could not find any
mailing list entry (so sorry for re-asking this question). Is there
any way, except deletion, to reduce the size of a Jena TDB
directory/index.
Cheers, Daniel
[1]
http://stackoverflow.com/questions/11088082/how-to-reduce-the-size-of-the-tdb-backed-jena-dataset
Daniel,
Your question is a good one - the ful answer depends on the details of your
setup though.
The indexes won't shrink - TDB never gives disk space back to the OS - but disk space is
reused when reallocated within the same JVM. If you are deleting, stopping, restarting
(hence different JVMs), then there can be memory leaks but it sounds like this is not the
case here as the "leak" in that case can be most of the database and you'd
notice!
The other issue is blank nodes - does your data have a significant amount of
blank nodes? If so, each load is creating new blank nodes. Nodes are not
garbaged collected so old blank nodes (and unused URIs and literals) remain in
the node table.
Hi Andy,
Thanks for your insights.
Well yes. I do have blank nodes. So there is no way if manually cleaning up the
node table? I wonder how this can be excepted behavior. Who want's to run a
database which grows everyday for hundreds of MBs (while importing 200k triple)?
If you are clearing out an entire database, then closing the database (and
removing from the StoreConnection manager), deleting the files, then loading,
which can be by bulk loader, may work for you.
Well I can't simply delete everything since I do have different graphs inside
this directory.
Do you see any chance to fix the issue?
Cheers,
Daniel
Daniel,
If it's hundreds of MBs for 200k triples, then it is not the node table
growing with blank nodes.
Are there large literals in the data?
Which of the files in database directory are growing most?
Do you do this as separate processes (separate JVMs, restart between
delete and load)?
Andy
BTW Most systems don't GC their node table - it's a tradeoff and
generally the side of effciency wins. It needs reference counting
otherwise.