Re: Storing values in Lucene index

Andy Seaborne Sat, 28 Feb 2015 10:10:13 -0800

On 27/02/15 17:09, Osma Suominen wrote:

27.02.2015, 18:06, Andy Seaborne wrote:

This is inefficient if there happen to be lots of skos:altLabel values,
as there are in e.g. AGROVOC thesaurus data.


How many skos:altLabel can occur in that dataset?


As an extreme example, <http://aims.fao.org/aos/agrovoc/c_1548> (the
country Chile) has 433 altLabels. The typical case (if there's such a
thing - it's probably a long tail distribution) is more like a dozen per
concept. AGROVOC has terms in over 20 languages. Queries involving the
literals tend to be a bit slow...

jena-text is a bit misnamed.  It's an entity index : "find subjects such
that ..."  Entity indexes make the conjunctive use cases work, "find
entities such that :property1 matches ... and :property2 matches ...".

The example above is closer to a text index (query -> literal) LARQ
could do both in different configurations (not at the same time) through
people tended to use it as a text index and then look in the RDF to make
it an entity index.  It can't in a single call do the conjunctive use
case nor be particularly easy to manage specific properties in different
ways.

I have come to realise that we might provide both kinds of index
separately.  A tightly managed literal-text-index could have deeper
integration into query processing e.g. FILTER expressions.


I don't oppose, but I don't really follow either. Is there something
fundamentally wrong with the (?s ?value) text:query 'blah' query style
that I suggested? It's not like its unusual to store the actual values
in a Lucene index... Lucene supports it (and Solr too), LARQ does it,
many people do it. I understand that not all people will need it (and
the associated size/performance costs), but it could be made optional.

I don't know if there is anything fundamentally wrong except the lack ofconjunctive query. A conjunctive expression can be on multiple aspectsof an entity , multiple properties.

Example: storing (postal) addresses. Then search on town name andstreet name in the same Lucene request.

One case possible (not often done) is that the RDF does not hold theliteral at all. e.g. The entity is a large text document; the RDF holdsthe metadata.

As mentioned, reclaiming from the text index isn't possible in anyscheme that does not reference count the entries.

The property function style is a generative index - it produces matches.It can be used less efficiently

But it's generality makes query planning hard. A tightly couple indexwhich was only indexing for literals, maybe storing them can have statsetc maintained.


You can't use that style the generative index in a filter

FILTER ( text:matches(?literal, 'lucene query') )

you can use it in that fashion with

?s :p ?literal .
(?x ?literal) text:query 'foo'

but the optimizer isn't going to reorder filters.

        Andy


-Osma

Re: Storing values in Lucene index

Reply via email to