Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

Kingsley Idehen Wed, 24 Sep 2008 11:57:40 -0700


Paul Gearon wrote:

On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <[EMAIL PROTECTED]> wrote:

On 09/19/08/09/08 23:12 +0200, Orri Erling wrote:

Has has there been any analysis on whether there is a *fundamental*
reason for such performance difference? Or is it simply a question of
"maturity"; in other words, relational db technology has been around for a
very long time and is very mature, whereas RDF implementations are still
quite recent, so this gap will surely narrow ...?

This is a very complex subject.  I will offer some analysis below, but
this I fear will only raise further questions.  This is not the end of the
road, far from it.

As far as I understand, another issue is relevant: this benchmark is
somewhat unfair as the relational stores have one advantage compared to the
native triple stores: the relational data structure is fixed (Products,
Producers, Reviews, etc with given columns), while the triple representation
is generic (arbitrary s,p,o).


This point has an effect on several levels.

For instance, the flexibility afforded by triples means that objects
stored in this structure require processing just to piece it all
together, whereas the RDBMS has already encoded the structure into the
table. Ironically, this is exactly the reason we
(Tucana/Kowari/Mulgara) ended up building an RDF database instead of
building on top of an RDBMS: The flexibility in table structure was
less efficient that a system that just "knew" it only had to deal with
3 columns. Obviously the shape of the data (among other things)
dictates what it is the better type of storage to use.

A related point is that processing RDF to create an object means you
have to move around a lot in the graph. This could mean a lot of
seeking on disk, while an RDBMS will usually find the entire object in
one place on the disk. And seeks kill performance.

This leads to the operations used to build objects from an RDF store.
A single object often requires the traversal of several statements,
where the object of one statement becomes the subject of the next.
Since the tables are typically represented as
Subject/Predicate/Object, this means that the main table will be
"joined" against itself. Even RDBMSs are notorious for not doing this
efficiently.

One of the problems with self-joins is that efficient operations like
merge-joins (when they can be identified) will still result in lots of
seeking, since simple iteration on both sides of the join means
seeking around in the same data. Of course, there ARE ways to optimize
some of this, but the various stores are only just starting to get to
these optimizations now.

Relational databases suffer similar problems, but joins are usually
only required for complex structures between different tables, which
can be stored on different spindles. Contrast this to RDF, which needs
to do do many of these joins for all but the simplest of data.

One can question whether such flexibility is relevant in practice, and if
so, one may try to extract such structured patterns from data on-the-fly.
Still, it's important to note that we're comparing somewhat different things
here between the relational and the triple representation of the benchmark.


This is why I think it is very important to consider the type of data
being stored before choosing the type of storage to use. For some
applications an RDBMS is going to win hands down every time. For other
applications, an RDF store is definitely the way to go. Understanding
the flexibility and performance constraints of each is important. This
kind of benchmarking helps with that. It also helps identify where RDF
databases need to pick up their act.

Regards,
Paul Gearon

Paul,

You make valid points, the problem here is that the benchmark has beenreleased without enough clarity about it's prime purpose. To evencompare RDF Quads Stores with an RDBMS engine when the schema isRelational in itself is kinda twisted.

The role of mappers (DR2Q & Virtuoso RDF Views) for instance, shouldhave been made much clearer, maybe in separate results tables. I saythis because these mappers offer different approaches to projectingRDBMS based data in RDF Linked Data form, on the fly, and their purposein this benchmark is all about raw performance and scalability as itrelates to following RDF Linked Data generation and deployment conditions:


1. Schema is Relational
2. RDF warehouse is impractical

As I am sure you know, we could invert this whole benchmark "Open World"style, and then bring RDBMS engines to their knees by incorporatingSPARQL query patterns comprised of ?p's and subclasses .

To conclude, the quad store numbers should simply be a conparisons ofthe quad stores themselves, and not the quad stores vs the mappers ornative SQL. This clarification really needs to make it's way into thebenchmark narrative.



--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen

President & CEOOpenLink Software Web: http://www.openlinksw.com

Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

Reply via email to