Re: TDB: release process

Simon Helsen Wed, 18 Jan 2012 16:54:56 -0800

Andy,

it is tricky for me to provide the suite because it is embedded in a 
larger framework. Yet, the numbers are clean IMO because the times I 
provided are taken around the calls into Jena. Moreover, the absolute 
numbers don't matter very much. Some of the queries are somewhat contrived 
in their complexity and the suite was designed to be very configurable 
(making it harder to determine what the expected results have to look 
like). The difference between the 2 tests is the usage of Jena and the 
bound TDB, so whatever difference in times I see is mostly attributable to 
this, not the framework.

For us, the key is the relative numbers with vanilla TDB. (up to 0.8.x). 
It is surprising that if reads are not blocked by writers that the read 
requests take as long as they do. In the vanilla numbers, we keep track 
how much time we "wait" and it is quite significant. I reran the tests 
where I reduced the length to get more stable numbers. Instead of 
copy-pasting the numbers, I am attaching them as images this time. 

There are 4 files, 2 for tdb and 2 for tx tdb and for each one there is 
pair of tables for indexing operations (using standard jena APIs, not 
sparql Insert) and a pair of tables for query operations

I circled the relevant rows (you can ignore the 3 other rows in each 
table). I put a red box around a few relevant numbers such as total query 
time, which is important because in TDB (where we use our own locks) we 
measure wait times versus actual execution times. In TxTDB, we cannot do 
this, so the total time is more or less the total time for each type of 
query (in this case DESCRIBE and SELECT). In TDB, you can see the time 
both write and read operations have to wait for each other. In TxTDB, 
there is no such thing. For the write operations, we distinguish between 
bulk and non-bulk. During the actual scalability test, not much is bulked, 
so in practice you can add these numbers. The reason is that we start the 
test by writing out about 2000 resources (named graphs), so you'll have 
some bulking there. In TxTDB, bulking means just that we combine write 
operations in one transaction. Operations are combined in a transaction 
when they happen quickly after each other (knowing that it delays the 
visibility of data). But again, this rarely happens during the scalability 
test itself. Finally, you'll notice a slight difference in the number of 
queries executed in TDB and some additional columns (such as overtaking 
and abort and reset). This has to do with a slight variations of the 
standard exclusive write algorithm we employed. It improves the average 
query time a bit in multi-threaded tests compared to the naive exclusive 
write locking mechanism. But it should not be able to beat what you're 
trying to achieve with TxTDB.

I think one can say with these numbers that on average TxTDB needs about 
2,5 times longer to finish a query transaction in the given test scenario 
(50 clients, 2s wait time between operations and a ration of 7/1 
read/write). There could be a few reasons, but since transactions are more 
opaque than when using vanilla TDB, it is hard for me to tell where the 
time goes, i.e. whether there are locks inside the TxTDB code or whether 
it is genuine CPU overhead. Note that in the numbers I provided that in 
TxTDB, factual parallelism is higher, so, perhaps this slows down the 
bottom line?

To answers your original question: it would be great if you have a Jena 
test framework, but I am currently not in a position to contribute 
extensively. I think the moment we actually adopt TxTDB, this may change. 
However, I can offer to run these scalability runs locally whenever you 
have improvements or algorithmic changes and then return the results

Simon

From:
Andy Seaborne <[email protected]>
To:
Simon Helsen/Toronto/IBM@IBMCA
Cc:
[email protected]
Date:
01/18/2012 03:22 AM
Subject:
Re: TDB: release process

On 17/01/12 22:14, Simon Helsen wrote:
> 4) I understand what you're new strategy is, but could this not lead to
> starvation of read transactions?

No - a reader can't be overtaken by a later writer so no starvation.  A 
reader sees the state of the database as at the last committed write 
transaction, and does not see any changes from any later writers (the 
isolation level is "serialized" even through there is concurrency). 
Readers are not blocked by writers, unlike TDB upto 0.8.x.

Can you provide us with something to run the tests ourselves?  You have 
said in the past its part of a larger test framework but if it can't be 
separated out doesn't that indicate the rest of the test framework is 
caught up in the numbers?

Maybe an addition to the emerging performance framework JenaPerf [1]? 
(It's in scala but, writing "better Java" is a good way to start scala)

https://svn.apache.org/repos/asf/incubator/jena/Experimental/JenaPerf/trunk/ 

                 Andy

Re: TDB: release process

Reply via email to