On 30/03/11 22:46, Stephen Allen wrote:
Andy,

As an aside, I recall you mentioning that you had a BDB version of
TDB, using that would seem to offer a fast, stable way of adding
transactions to your B-trees.  Out of curiosity, were there problems
with using BDB?

https://github.com/afs/TDB-BDB

No problems as such but it just isn't very fast (non-transactionally). There is no bulk loading advantage at all, and query performance was slower but OK. That's before turning on transactions. As the data scaled, the difference between TDB native and TDB-BDB became more pronounced.

BDB-C and BDB-JE are about the same speed.

Given they were already slower, and for TxTDB, I want to retain reader-performance, that doesn't look like a good starting point.

It might be a good place for a version with different goals - less emphasis on scale, more on high-frequency writer (and less reads), for example a sensor data hub.

I don't know why they are slower but I speculate that the general purpose design of both BDBs (e.g. fully variable length key and value, node size, overhead in the tree blocks for all sorts of features not used) means it is optimized for something else. BDB is designed for highed-write concurrency - RDF datastores are for publishing (read dominant). Sometimes these design objectives pull in different directions.

I used BDB to store the string table as well (lexical forms of nodes). It was better to use a native string file.

Maybe it's a case of not using them to their best advantage.

tdbloader1 simply does the loading work in an order that is better than adding triples one at a time, inbexsing as you go. It loads the primary index, then builds the secondary indexes by copying from the primary. That applies to BDB but it didn't help.

tdbloader2 uses Unix sort(1) to prepare the index data by sorting into the order for each index, then writes the B+Trees directly to disk (from the bottom up and very carefully).

        Andy

Reply via email to