Re: At which point should I consider using text-query indexes?

Rob Vesse Tue, 23 May 2017 01:53:52 -0700

That’s a difficult question to answer because it all depends upon your data and 
what you consider an acceptable level of performance


 Generally speaking, if you find yourself doing a very general pattern and then 
filtering with a string function you may be better served by text indexing e.g.

SELECT *
WHERE
{
  ?s ?p ?o . # Scan all the data
  FILTER(STRSTARTS(?label, “foo”))
}

However, if your query first reduces the set of data over which the filter must 
apply by doing a more specific pattern then string functions may be fine e.g.

SELECT *
WHERE
{
  ?s a <urn:some-type> ;
       <urn:some-predicate> ?value . # Find some specific subset of the data
  FILTER(STRSTARTS(?value, “foo”))
}

But it very much depends on the details and generally it will be best to 
benchmark your specific use case on your data and the judge for yourself. It as 
you imply you are creating an application which hides the details of SPARQL 
from the user you are free to adjust the underlying queries as you see fit

Rob

On 23/05/2017 08:39, "Laura Morales" <laure...@mail.com> wrote:

    Oh, this is interesting. I thought that predicates values (rdfs:label in 
this case) were already sorted and that using STRSTARTS() would be fast because 
it could take advantage of binary search or something. I didn't expect that 
this function would have to scan all the predicate values.
    So in which scenario are sparql STR functions acceptable to use (in terms 
of "reasonable performance")?
    
    
    
    Laura Morales kirjoitti 23.05.2017 klo 10:23:
    
    > Thank you for the answer. So let's say I want to search nodes in my graph 
by rdfs:label. Is this correct...
    >
    > 1) STRSTART(): fast by default because predicates are sorted. Only does 
exact search.
    > 2) STRSTART(LCASE(?label)): fast because predicates are sorted, but just 
a little bit slower than 1) because if muse LCASE() some strings
    > 3) REGEX(): slow because it must go through all rdfs:labels (use 
jena-text instead)
    > 4) CONTAINS(): slow because it must go through all rdfs:labels (use 
jena-text instead)
    >
    > Is this correct?
    
    I believe all of these are roughly equivalent in terms of performance.
    All of them need to scan all the rdfs:label values. Obviously REGEX is a
    bit more expensive than e.g. STRSTARTS but the difference is not very
    big. I don't think there's any sorting of predicate values in TDB that
    would help here.
    
    > If my app has an input search box where users can search an item by title 
(on a large graph), would it be a good idea to go with 2) or should I consider 
setting up a text-query index?
    
    I recommend setting up a text index if you want to do partial matching
    of labels from a large graph.
    
    -Osma
    
    --
    Osma Suominen
    D.Sc. (Tech), Information Systems Specialist
    National Library of Finland
    P.O. Box 26 (Kaikukatu 4)
    00014 HELSINGIN YLIOPISTO
    Tel. +358 50 3199529
    osma.suomi...@helsinki.fi
    http://www.nationallibrary.fi

Re: At which point should I consider using text-query indexes?

Reply via email to