Re: At which point should I consider using text-query indexes?

2017-05-23 Thread Rob Vesse
That’s a difficult question to answer because it all depends upon your data and 
what you consider an acceptable level of performance

 Generally speaking, if you find yourself doing a very general pattern and then 
filtering with a string function you may be better served by text indexing e.g.

SELECT *
WHERE
{
  ?s ?p ?o . # Scan all the data
  FILTER(STRSTARTS(?label, “foo”))
}

However, if your query first reduces the set of data over which the filter must 
apply by doing a more specific pattern then string functions may be fine e.g.

SELECT *
WHERE
{
  ?s a  ;
?value . # Find some specific subset of the data
  FILTER(STRSTARTS(?value, “foo”))
}

But it very much depends on the details and generally it will be best to 
benchmark your specific use case on your data and the judge for yourself. It as 
you imply you are creating an application which hides the details of SPARQL 
from the user you are free to adjust the underlying queries as you see fit

Rob

On 23/05/2017 08:39, "Laura Morales"  wrote:

Oh, this is interesting. I thought that predicates values (rdfs:label in 
this case) were already sorted and that using STRSTARTS() would be fast because 
it could take advantage of binary search or something. I didn't expect that 
this function would have to scan all the predicate values.
So in which scenario are sparql STR functions acceptable to use (in terms 
of "reasonable performance")?



Laura Morales kirjoitti 23.05.2017 klo 10:23:

> Thank you for the answer. So let's say I want to search nodes in my graph 
by rdfs:label. Is this correct...
>
> 1) STRSTART(): fast by default because predicates are sorted. Only does 
exact search.
> 2) STRSTART(LCASE(?label)): fast because predicates are sorted, but just 
a little bit slower than 1) because if muse LCASE() some strings
> 3) REGEX(): slow because it must go through all rdfs:labels (use 
jena-text instead)
> 4) CONTAINS(): slow because it must go through all rdfs:labels (use 
jena-text instead)
>
> Is this correct?

I believe all of these are roughly equivalent in terms of performance.
All of them need to scan all the rdfs:label values. Obviously REGEX is a
bit more expensive than e.g. STRSTARTS but the difference is not very
big. I don't think there's any sorting of predicate values in TDB that
would help here.

> If my app has an input search box where users can search an item by title 
(on a large graph), would it be a good idea to go with 2) or should I consider 
setting up a text-query index?

I recommend setting up a text index if you want to do partial matching
of labels from a large graph.

-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi







Re: At which point should I consider using text-query indexes?

2017-05-23 Thread Laura Morales
Oh, this is interesting. I thought that predicates values (rdfs:label in this 
case) were already sorted and that using STRSTARTS() would be fast because it 
could take advantage of binary search or something. I didn't expect that this 
function would have to scan all the predicate values.
So in which scenario are sparql STR functions acceptable to use (in terms of 
"reasonable performance")?



Laura Morales kirjoitti 23.05.2017 klo 10:23:

> Thank you for the answer. So let's say I want to search nodes in my graph by 
> rdfs:label. Is this correct...
>
> 1) STRSTART(): fast by default because predicates are sorted. Only does exact 
> search.
> 2) STRSTART(LCASE(?label)): fast because predicates are sorted, but just a 
> little bit slower than 1) because if muse LCASE() some strings
> 3) REGEX(): slow because it must go through all rdfs:labels (use jena-text 
> instead)
> 4) CONTAINS(): slow because it must go through all rdfs:labels (use jena-text 
> instead)
>
> Is this correct?

I believe all of these are roughly equivalent in terms of performance.
All of them need to scan all the rdfs:label values. Obviously REGEX is a
bit more expensive than e.g. STRSTARTS but the difference is not very
big. I don't think there's any sorting of predicate values in TDB that
would help here.

> If my app has an input search box where users can search an item by title (on 
> a large graph), would it be a good idea to go with 2) or should I consider 
> setting up a text-query index?

I recommend setting up a text index if you want to do partial matching
of labels from a large graph.

-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi


Re: At which point should I consider using text-query indexes?

2017-05-23 Thread Osma Suominen

Hi Laura!

Laura Morales kirjoitti 23.05.2017 klo 10:23:


Thank you for the answer. So let's say I want to search nodes in my graph by 
rdfs:label. Is this correct...

1) STRSTART(): fast by default because predicates are sorted. Only does exact 
search.
2) STRSTART(LCASE(?label)): fast because predicates are sorted, but just a 
little bit slower than 1) because if muse LCASE() some strings
3) REGEX(): slow because it must go through all rdfs:labels (use jena-text 
instead)
4) CONTAINS(): slow because it must go through all rdfs:labels (use jena-text 
instead)

Is this correct?


I believe all of these are roughly equivalent in terms of performance. 
All of them need to scan all the rdfs:label values. Obviously REGEX is a 
bit more expensive than e.g. STRSTARTS but the difference is not very 
big. I don't think there's any sorting of predicate values in TDB that 
would help here.



If my app has an input search box where users can search an item by title (on a 
large graph), would it be a good idea to go with 2) or should I consider 
setting up a text-query index?


I recommend setting up a text index if you want to do partial matching 
of labels from a large graph.


-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi


Re: At which point should I consider using text-query indexes?

2017-05-23 Thread Laura Morales
> Hi Laura!
> 
> The string functions are standard SPARQL features and don't rely on a
> text index.
> 
> The text index is only useful if you need either full text search or
> other efficient non-exact types of text matching such as prefix searches.
> 
> You can always use FILTERs and string functions, but they can be slow
> when you have large amounts of data as they need to be evaluated for
> every string in the data.
> 
> -Osma


Thank you for the answer. So let's say I want to search nodes in my graph by 
rdfs:label. Is this correct...

1) STRSTART(): fast by default because predicates are sorted. Only does exact 
search.
2) STRSTART(LCASE(?label)): fast because predicates are sorted, but just a 
little bit slower than 1) because if muse LCASE() some strings
3) REGEX(): slow because it must go through all rdfs:labels (use jena-text 
instead)
4) CONTAINS(): slow because it must go through all rdfs:labels (use jena-text 
instead)

Is this correct?

If my app has an input search box where users can search an item by title (on a 
large graph), would it be a good idea to go with 2) or should I consider 
setting up a text-query index?


Re: At which point should I consider using text-query indexes?

2017-05-23 Thread Lorenz B.
Without a fulltext index SPARQL string matching/containment operations
might need a full scan of the data compared to the efficiency of
datastructures like inverse lists etc. that are used in a fulltext index.

> Yes I understand this, my problem is that I don't understand why/when would I 
> want to use this and when standard STR functions are OK instead.
>
>
>> The fullext index via Lucene is only used if you use the corresponding
>> special predicate (which is non-standard SPARQL syntax) in your query:
>>
>> text:query
>>
>> e.g. in a query like (taken form the documentation):
>>
>> PREFIX text: 
>> PREFIX rdfs: 
>> 
>>
>> SELECT ?s
>> { ?s text:query (rdfs:label 'word' 10) ;
>> rdfs:label ?label
>> }

-- 
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center



Re: At which point should I consider using text-query indexes?

2017-05-23 Thread Laura Morales
Yes I understand this, my problem is that I don't understand why/when would I 
want to use this and when standard STR functions are OK instead.


> The fullext index via Lucene is only used if you use the corresponding
> special predicate (which is non-standard SPARQL syntax) in your query:
> 
> text:query
> 
> e.g. in a query like (taken form the documentation):
> 
> PREFIX text: 
> PREFIX rdfs: 
> 
> 
> SELECT ?s
> { ?s text:query (rdfs:label 'word' 10) ;
> rdfs:label ?label
> }


Re: At which point should I consider using text-query indexes?

2017-05-23 Thread Osma Suominen

Hi Laura!

The string functions are standard SPARQL features and don't rely on a 
text index.


The text index is only useful if you need either full text search or 
other efficient non-exact types of text matching such as prefix searches.


You can always use FILTERs and string functions, but they can be slow 
when you have large amounts of data as they need to be evaluated for 
every string in the data.


-Osma

Laura Morales kirjoitti 23.05.2017 klo 09:57:

I'm reading the documentation at 
https://jena.apache.org/documentation/query/text-query.html but I don't 
understand if this is only for full-text searches. Or should I use one of these 
indexes every time I use one the string functions 
(https://www.w3.org/TR/sparql11-query/#func-strings) such as CONTAINS, LCASE, 
etc.?




--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi


Re: At which point should I consider using text-query indexes?

2017-05-23 Thread Lorenz B.
The fullext index via Lucene is only used if you use the corresponding
special predicate (which is non-standard SPARQL syntax) in your query:

text:query

e.g. in a query like (taken form the documentation):

PREFIX text: 
PREFIX rdfs: 

SELECT ?s
{ ?s text:query (rdfs:label 'word' 10) ;
 rdfs:label ?label
}


> I'm reading the documentation at 
> https://jena.apache.org/documentation/query/text-query.html but I don't 
> understand if this is only for full-text searches. Or should I use one of 
> these indexes every time I use one the string functions 
> (https://www.w3.org/TR/sparql11-query/#func-strings) such as CONTAINS, LCASE, 
> etc.?
>
-- 
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center