On 30/05/11 05:41, Al Baker wrote:
Hi Andy,

Great to hear about the transaction work.

For the N-Quad export on a live database, do you mean that a running
application - with open Jena models, and possibly a thread writing to it
will not interfere with the N-Quad dump - the only gotcha would be a
possible missed quad just before/after the export operation?  If so, I think
that would sufficie for the short-term.

An n-quad dump is another read operation - MRSW applies.

I guess another possiblity is within the app, fire up a thread and do a
Model.write out to the filesystem to save the entire model.

Yes - again, it's read operation.

Regarding bulk imports - I'm actually finding regular Jena model
manipulation runs very fast with TDB.  Within second can have a 100k
statement TDB store setup.

Ok - if your data is 100K, then many of these issues do not apply very strongly. at that sort of size, the data is completely cached in-memory. It's when you get into the many 10's millions of triples that things get complicated because operations take an appeciable length of time.

I'm not looking to boil the giant datasets out there, I'm looking at
practical how-to-build-apps perspective.  Speaking of which, it would be
nice if the cache was controllable - similar to Ehcache (or maybe an idea
for a future project for TDB to use Ehcache) - TTL, max in memory, etc.

EhCache is interesting (caveat understanding it's license dependencies) and certainly better cache control is desirable.

What would be good is to have "scan-resistant" eviction policy (such as ARC or LRU2) which retain very commonly used blocks even if the access pattern includes a pass through a large proportion of the database. Currently, the current B+trees have separate caches for branches and leaves, which reduces the scan problem somewhat. The old B-Tree code was susceptible to scan access patterns making the caching inefficient.

        Andy



Thanks,
Al


On Sun, May 29, 2011 at 2:52 PM, Andy Seaborne<
[email protected]>  wrote:



On 28/05/11 08:30, Al Baker wrote:

I've been testing TDB for a while, and am very impressed with its
performance.  However, I do see the various emails on the mailing lists
warning of touching the files while an application with TDB is open
(presumably with an open Jena Model attached to the TDB directory).

What kind of reliability does TDB have to survive a power hit or
application
crash?

Are there some steps to take consistent and regular backups to mitigate
any
issues?

Basically looking to have some level of confidence that I can use TDB in
production, take a reasonable amount of steps to insure reliability, and
be
confident that I'll always either have a valid TDB store, or a way to
incrementally backup/rollback in the case of a severe crash/file system
error.

Thanks,
Al Baker


Hi Al,

Currently, TDB provides some update capabilities but relies on the
application maintaining MRSW (Multiple Reader Or Single Writer) concurrency
semantics together with a clean shutdown.  Many of the reports are due to
letting two writers access the database at the same time or crashes without
ensuring a sync() is done which currently is important for updates.

For read-only usage, the database is safe - it is modified or reorganised
by reads so loss of machines or applications does not damage the on-disk
database.

TDB is an in-process database - one JVM controls the database.  Having two
managing the files also will cause damage.

You can backup a database by copy but only from a running system if you
co-ordinate with a sync() which makes the on-disk structures consistent.
  Stopping the DB is better and is needed on some OS's but dumping to N-Quads
can be done on a live database.

For updates, there are periods of vulnerability.  This is being addressed
by adding ACID transactions to TDB.  The transaction system is based on
write-ahead logging; read requests go straight to the DB as before so
performance there will be unchanged.

The disk format is (probably) going to be unchanged.  There are some
improvements that can be made but they aren't necessary.

The bulk loader used to build a database from scratch will provide the best
load performance.  It will remain non-transactional. Transactions will be
aimed at non-bulk updates.  Where the practical boundary will be will emerge
in testing.

The transaction work is active-work-in-progress [*] but I'm not going to
give specific release schedules except to say that as an open source
project, "release early, release often" of development versions will happen.

        Andy

[*] Indeed, I'm writing a journaled file abstraction at this moment.


Reply via email to