Yes, this is exactly what I was looking for. Thanks a lot Andy.
On Sat, Dec 8, 2012 at 6:50 PM, Andy Seaborne <[email protected]> wrote: > On 07/12/12 14:28, Laurent Pellegrino wrote: > >> Hello all, >> >> I wonder whether there exists a page that summarizes briefly how the >> SPARQL >> Operations and Functions are handled by TDB. The idea is to know what are >> the functions or operations that use B+-tree indexes (or more generally a >> specific datastructure, a property, etc.) to be resolved "efficiently" and >> what are those that should use with care on a big dataset without a >> previous BGP filtering (i.e. there is "no other way" to improve it or this >> not yet implemented, etc.). >> >> If there is no documentation about that, does someone know how the >> following operators are handled internally by Jena TDB if we suppose a BGP >> that filters nothing and a FILTER with one of the following function or >> operator (it should be the worst case) : >> >> - datatype(?x) = xsd:integer, is there a kind of index for each datatype >> associated to a quad/triple such that when this condition appears it can >> be >> checked "efficiently" without comparing the values or NodeIds against all >> the values return for example after a simple BGP? Is a datatype URI stored >> inside the NodeTable or another table? >> >> - STRSTARTS(STR(?x), "coucou") >> >> - The simple "=" operator >> >> - sameTerm >> >> Kind Regards, >> >> Laurent >> > > There isn't a strong connection between the functions and operators and > the TDB index design but the high level optimizer does perform some > optimization that work well with the indexes by making patterns as grounded > as possible. > > Equality rewrite is done in ARQ; filter placement, for TDB, is done in TDB > (it can also happen in ARQ - there's a tension between whether to optimize > the BGP then place filters, or place filters then optimize. > > The equality filter isn't very aggressive on literals because BGP matching > is by term, whereas FILTER is by value. +0123 and 123 are the same value > but different literals. I'm not convinced this is a good idea and maybe it > ought to change - data loading would canonicalize literals, > > The other change is a RDF 1.1 thing. Simple literals go away and there is > only xsd:strings so the "=" then works on lang-tag-less strings like BGP > matching does. > > Examples: > > URIs are always safe to transform: > > { ?s ?p ?o . FILTER ( ?o = <uri> } > => > { ?s ?p <uri> . BIND(<uri> AS ?o) } > > Sameterm on numbers > > { ?s ?p ?o . FILTER ( ?o = 123 } > > is not safe to transform (?o = 00123 is a match) but > > { ?s ?p ?o . FILTER ( sameTerm(?o,123) } > > is safe. > > sameTerm(?o,"abc"@en) is optimized > > (?o = "abc"@en) isn't optimized - you can call that a bug-of-omission. > I don't see why it can't do it - it seems to treat it like > > (?o = "abc") which has two pattern matches. Hmm - thinking about it, it > could treat that as a disjunction of sameTerm. Doable now ... a bit of a > "doh" moment there. > > ?o = "abc" > ==> > sameTerm(?o, "abc") || sameTerm(?o, "abc"^^xsd:String) > > The equality optimization works with IN as well. The query is expanded > for the disjunction then equality rewrite happens. > > That's sameTerm and = covered - it's mainly a high level (algebra to > algebra) transformation. > > datatype(?x) = xsd:integer > > There is no optimization for this. SDB stores the data in a form where it > could be done, but it doesn't. TDB does not store the datatype separately > - it could, but it would be a different table layout. > > Is it useful? > > STRSTARTS(STR(?x), "coucou") > > No specific optimization for this. There isn't a prefix index. Again, > something that can be added, it just hasn't been. You can use LARQ for > similar effects. It would be nice though to have an integrated prefix > index, and even some regex acceleration (c.f. SQL's LIKE). > > Many optimizations can be done that aren't. There is a slight issue that > too much optimization means that simple queries slow down as more time is > spent optimizing than just simply doing the query. BSBM shows this. > > Also, it is useful to note that in TDB, certain datatypes are store > "inline" -- that is, the value is stored in the index itself, using 56 bits > of the 64 bit NodeId for the encoded value. That means converting an object > in a triple to it's value for testing in the ARQ is quite cheap. e.g. > Testing being in a range is quite cheap (no node table access). Parts of > the BSBM benchmark show this up in quite extreme ways. > > Filter placement also happens : with a long BGP, the filter can be placed > just after the point where all the necessary variables are defined, before > further expansion of possibilities in later parts of the pattern. > > Hope that helps - is it what you are looking for? > > Andy >
