On 11/10/13 08:52, Daniel Gerber wrote:

On 10.10.2013, at 12:37, Andy Seaborne <[email protected]> wrote:

On 10/10/13 10:37, Daniel Gerber wrote:
Hi,
I'm importing 20Mb of data every day into a Jena TDB store.
Before insertion, I'm deleting everything (model.removeAll()). But I
noticed that the size of the index does not shrink, it even increases
every day (it's now at 11GB and soon will hit physical limits). I
found this question [1] on stack overflow but could not find any
mailing list entry (so sorry for re-asking this question). Is there
any way, except deletion, to reduce the size of a Jena TDB
directory/index.

Cheers, Daniel

[1]
http://stackoverflow.com/questions/11088082/how-to-reduce-the-size-of-the-tdb-backed-jena-dataset


Daniel,

Your question is a good one - the ful answer depends on the details of your 
setup though.

The indexes won't shrink - TDB never gives disk space back to the OS - but disk space is 
reused when reallocated within the same JVM.  If you are deleting, stopping, restarting 
(hence different JVMs), then there can be memory leaks but it sounds like this is not the 
case here as the "leak" in that case can be most of the database and you'd 
notice!

The other issue is blank nodes - does your data have a significant amount of 
blank nodes?  If so, each load is creating new blank nodes. Nodes are not 
garbaged collected so old blank nodes (and unused URIs and literals) remain in 
the node table.

Hi Andy,
Thanks for your insights.
Well yes. I do have blank nodes. So there is no way if manually cleaning up the 
node table? I wonder how this can be excepted behavior. Who want's to run a 
database which grows everyday for hundreds of MBs (while importing 200k triple)?

If you are clearing out an entire database, then closing the database (and 
removing from the StoreConnection manager), deleting the files, then loading, 
which can be by bulk loader, may work for you.

Well I can't simply delete everything since I do have different graphs inside 
this directory.
Do you see any chance to fix the issue?

Cheers,
Daniel

Daniel,

If it's hundreds of MBs for 200k triples, then it is not the node table growing with blank nodes.

Are there large literals in the data?

Which of the files in database directory are growing most?

Do you do this as separate processes (separate JVMs, restart between delete and load)?

        Andy

BTW Most systems don't GC their node table - it's a tradeoff and generally the side of effciency wins. It needs reference counting otherwise.

Reply via email to