On 27/02/15 17:09, Osma Suominen wrote:
27.02.2015, 18:06, Andy Seaborne wrote:
This is inefficient if there happen to be lots of skos:altLabel values,
as there are in e.g. AGROVOC thesaurus data.
How many skos:altLabel can occur in that dataset?
As an extreme example, <http://aims.fao.org/aos/agrovoc/c_1548> (the
country Chile) has 433 altLabels. The typical case (if there's such a
thing - it's probably a long tail distribution) is more like a dozen per
concept. AGROVOC has terms in over 20 languages. Queries involving the
literals tend to be a bit slow...
jena-text is a bit misnamed. It's an entity index : "find subjects such
that ..." Entity indexes make the conjunctive use cases work, "find
entities such that :property1 matches ... and :property2 matches ...".
The example above is closer to a text index (query -> literal) LARQ
could do both in different configurations (not at the same time) through
people tended to use it as a text index and then look in the RDF to make
it an entity index. It can't in a single call do the conjunctive use
case nor be particularly easy to manage specific properties in different
ways.
I have come to realise that we might provide both kinds of index
separately. A tightly managed literal-text-index could have deeper
integration into query processing e.g. FILTER expressions.
I don't oppose, but I don't really follow either. Is there something
fundamentally wrong with the (?s ?value) text:query 'blah' query style
that I suggested? It's not like its unusual to store the actual values
in a Lucene index... Lucene supports it (and Solr too), LARQ does it,
many people do it. I understand that not all people will need it (and
the associated size/performance costs), but it could be made optional.
I don't know if there is anything fundamentally wrong except the lack of
conjunctive query. A conjunctive expression can be on multiple aspects
of an entity , multiple properties.
Example: storing (postal) addresses. Then search on town name and
street name in the same Lucene request.
One case possible (not often done) is that the RDF does not hold the
literal at all. e.g. The entity is a large text document; the RDF holds
the metadata.
As mentioned, reclaiming from the text index isn't possible in any
scheme that does not reference count the entries.
The property function style is a generative index - it produces matches.
It can be used less efficiently
But it's generality makes query planning hard. A tightly couple index
which was only indexing for literals, maybe storing them can have stats
etc maintained.
You can't use that style the generative index in a filter
FILTER ( text:matches(?literal, 'lucene query') )
you can use it in that fashion with
?s :p ?literal .
(?x ?literal) text:query 'foo'
but the optimizer isn't going to reorder filters.
Andy
-Osma