afs commented on code in PR #205:
URL: https://github.com/apache/jena-site/pull/205#discussion_r1949656077
##########
source/documentation/tdb/faqs.md:
##########
@@ -159,78 +145,139 @@ Fuseki the journal will be flushed to disk. When using
the [TDB Java API](java_a
TDBFactory.release(dataset);
}
-<a name="ssd"></a>
-## Should I use a SSD?
+### Why is the database much larger on disk than my input data?
{#input-vs-database-size}
-Yes if you are able to
+TDB2 uses copy-on-write data structures. This means that each new write
transaction takes copies of any data blocks it
+modifies during the transaction and writes new copies of those blocks with the
required modifications. The old blocks
+are not automatically removed as they might still be referenced by ongoing
read transactions. Depending on how you've
+loaded your data into TDB2 - how many transactions were used, how large each
transaction was, input data characteristics
+etc. - this can lead to much larger database disk size than your original
input data size.
-Using a SSD boost performance in a number of ways. Firstly bulk loads,
inserts and deletions will be faster i.e. operations that modify the
-database and have to be flushed to disk at some point due to faster IO.
Secondly TDB will start faster because the files can be mapped into
-memory faster.
+You can run a [Compaction](../tdb2/tdb2_admin.md#compaction) operation on your
database to have TDB2 prune the data
+structures to only preserve the current data blocks. Compactions require
exclusive write access to the database i.e. no
+other read/write transactions may occur while a compaction is running. Thus,
compactions should generally be run
+offline, or at quiet times if exposing your database to multiple applications
per [Can I share a TDB dataset between
+multiple applications?](#multi-jvm).
-SSDs will make the most difference when performing bulk loads since the
on-disk database format for TDB is entirely portable and may be
-safely copied between systems (provided there is no process accessing the
database at the time). Therefore even if you can't run your production
-system with a SSD you can always perform your bulk load on a SSD equipped
system first and then move the database to your production system.
+Please note that compaction creates a new `Data-NNNN` directory per [TDB2
Directory
+Layout](../tdb2/tdb2_admin.md#tdb2-directory-layout) into which it writes the
compacted copy of the database. The old
+directory won't be automatically removed unless the compaction operation was
explicitly configured to do so. Therefore,
+the immediate effect of a compaction may actually be more disk space usage
until the old data directory can be removed.
+If the database was already maximally compacted then there will be no
difference in size between the old and new data
+directories.
-<a name="lock-exception"></a>
-## Why do I get the exception *Can't open database at location /path/to/db as
it is already locked by the process with PID 1234* when trying to open a TDB
database?
+We would recommend that you consider running a compaction after an initial
bulk data load, although some bulk loading
Review Comment:
All the bulk loaders (default and parallel, even basic) should do decent job
and the database is reasonably compact.
A parallel load is better than compacted! It writes the leafs blocks to
full. Compaction could be made to be this algorithm but it isn't at the moment.
Compacting after `--loader=parallel` will expand the database.
It is loading many files in separate transactions (i.e one file at a time
even into an empty DB) that causes an oversized database.
Named graphs take up more space than single graphs (2x more indexes as well
as the larger quad vs triple).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]