Re: [PR] New input vs database size FAQ entry [jena-site]

via GitHub Mon, 10 Feb 2025 10:31:29 -0800


afs commented on code in PR #205:
URL: https://github.com/apache/jena-site/pull/205#discussion_r1949656077



##########
source/documentation/tdb/faqs.md:
##########
@@ -159,78 +145,139 @@ Fuseki the journal will be flushed to disk. When using 
the [TDB Java API](java_a
          TDBFactory.release(dataset);
       }
 
-<a name="ssd"></a>
-## Should I use a SSD?
+### Why is the database much larger on disk than my input data? 
{#input-vs-database-size}
 
-Yes if you are able to
+TDB2 uses copy-on-write data structures.  This means that each new write 
transaction takes copies of any data blocks it
+modifies during the transaction and writes new copies of those blocks with the 
required modifications.  The old blocks
+are not automatically removed as they might still be referenced by ongoing 
read transactions.  Depending on how you've
+loaded your data into TDB2 - how many transactions were used, how large each 
transaction was, input data characteristics
+etc. - this can lead to much larger database disk size than your original 
input data size.
 
-Using a SSD boost performance in a number of ways.  Firstly bulk loads, 
inserts and deletions will be faster i.e. operations that modify the 
-database and have to be flushed to disk at some point due to faster IO.  
Secondly TDB will start faster because the files can be mapped into
-memory faster.
+You can run a [Compaction](../tdb2/tdb2_admin.md#compaction) operation on your 
database to have TDB2 prune the data
+structures to only preserve the current data blocks.  Compactions require 
exclusive write access to the database i.e. no
+other read/write transactions may occur while a compaction is running.  Thus, 
compactions should generally be run
+offline, or at quiet times if exposing your database to multiple applications 
per [Can I share a TDB dataset between
+multiple applications?](#multi-jvm).
 
-SSDs will make the most difference when performing bulk loads since the 
on-disk database format for TDB is entirely portable and may be
-safely copied between systems (provided there is no process accessing the 
database at the time).  Therefore even if you can't run your production
-system with a SSD you can always perform your bulk load on a SSD equipped 
system first and then move the database to your production system.
+Please note that compaction creates a new `Data-NNNN` directory per [TDB2 
Directory
+Layout](../tdb2/tdb2_admin.md#tdb2-directory-layout) into which it writes the 
compacted copy of the database.  The old
+directory won't be automatically removed unless the compaction operation was 
explicitly configured to do so. Therefore,
+the immediate effect of a compaction may actually be more disk space usage 
until the old data directory can be removed.
+If the database was already maximally compacted then there will be no 
difference in size between the old and new data
+directories.
 
-<a name="lock-exception"></a>
-## Why do I get the exception *Can't open database at location /path/to/db as 
it is already locked by the process with PID 1234* when trying to open a TDB 
database?
+We would recommend that you consider running a compaction after an initial 
bulk data load, although some bulk loading

Review Comment:
   All the bulk loaders (default and parallel, even basic) should do decent job 
and the database is reasonably compact.
   
   A parallel load is better than compacted! It writes the leafs blocks to 
full. Compaction could be made to be this algorithm but it isn't at the moment.
   
   Compacting after `--loader=parallel` will expand the database.
   
   It is loading many files in separate transactions (i.e one file at a time 
even into an empty DB) that causes an oversized database. 
   
   Named graphs take up more space than single graphs (2x more indexes as well 
as the larger quad vs triple).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] New input vs database size FAQ entry [jena-site]

Reply via email to