Re: performance measures

Andrew U. Frank Sun, 24 Dec 2017 13:34:30 -0800

Thank you for the good advice!

The argument is to show that triple store are fast enough for linguistapplication. five years ago a comparison was published, where aproprietary data structure excelled. i would like to show thattriple-stores are fast enough today. I can perhaps get the same datasetand the same queries (at the application level), but i have no idea howcache data was accounted; it seems that results differed between runs.

i guess I could use some warmup queries, sort of similar to theapplication queries for the test and then run the test queries andcompare with the previously produced response times. If the responsetime is of the same order of magnitude than before, it would be shownthat triple-store is fast enough.


Does this sound "good enough"?


On 12/24/2017 01:24 PM, ajs6f wrote:

Any measurements would be unreliable at best and probably worthless.
1/ Different data gives different answers to queries.
2/ Caching matters a lot for databases and a different setup will cache 
differently.

This is so true, and it's not even a complete list. It might be better to 
approach the problem from the application layer. Are you able to put together a 
good suite of test data, queries, and updates, accompanied by a good 
understanding of the kinds of load the triplestore will experience in 
production?

Adam Soroka

On Dec 24, 2017, at 1:21 PM, Andy Seaborne <[email protected]> wrote:

On 24/12/17 14:11, Andrew U. Frank wrote:

thank you for the information; i take that using teh indexes  a one-variable 
query would be (close to) linear in the amount of triples found. i saw that TBD 
does build indexes and assumed they use hashes.
i have still the following questions:
1. is performance different for a named or the default graph?

Query performance is approximately the same for GRAPH.
Update is slower.

2. can i simplify measurements with putting pieces of the dataset in different 
graphs and then add more or less of these graphs to take a measure? say i have 
5 named graphs, each with 10 million triples, do queries over 2, 3, 4 and 5 
graphs give the same (or very similar) results than when i would load 20, 30, 
40 and 50 million triples in a single named graph?

Any measurements would be unreliable at best and probably worthless.

1/ Different data gives different answers to queries.

2/ Caching matters a lot for databases and a different setup will cache 
differently.

    Andy

thank you for help!
andrew
On 12/23/2017 06:20 AM, ajs6f wrote:

For example, the TIM in-memory dataset impl uses 3 indexes on triples and 6 on quads to ensure that all one-variable queries (i.e. 
for triples ?s <p> <o>, <s> ?p <o>, <s> <p> ?o) will be as direct as possible. The indexes are 
hashmaps (e.g. Map<Node, Map<Node, Set<Node>>>) and don't use the kind of node directory that TDB does.

There are lots of other ways to play that out, according to the balance of 
times costs and storage costs desired and the expected types of queries.

Adam

On Dec 23, 2017, at 2:56 AM, Lorenz Buehmann 
<[email protected]> wrote:


On 23.12.2017 00:47, Andrew U. Frank wrote:

are there some rules which queries are linear in the amount of data in
the graph? is it correct to assume that searching for a triples based
on a single condition (?p a X) is logarithmic in the size of the data
collection?

Why should it be logarithmic? The complexity of matching a single BGP
depends on the implementation. I could search for matches by doing a
scan on the whole dataset - that would for sure be not logarithmic but
linear. Usually, if exists, a triple store would use the POS index in
order to find bindings for variable ?p.

Cheers,
Lorenz


--
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
                                 +43 1 58801 12710 direct
Geoinformation, TU Wien          +43 1 58801 12700 office
Gusshausstr. 27-29               +43 1 55801 12799 fax
1040 Wien Austria                +43 676 419 25 72 mobil

Re: performance measures

Reply via email to