Re: Storing values in Lucene index

Andy Seaborne Fri, 27 Feb 2015 08:12:02 -0800

On 27/02/15 06:58, Osma Suominen wrote:

On 26/02/15 18:37, Stephen Allen wrote:

I would propose in the future that we actual store and not
just index the document so that it can be appropriately identified and
deleted.  This would require a change to existing Lucene databases (we
should provide a tool to reindex existing data).  An alternative to
actually storing the value would be to generate a hash of the
subject+predicate+object and store that as an identifier.

The same literal may be in the RDF graph multiple times. It's areference counting problem; maintaining a reference would be expensiveand limit scale.

I second storing the original value in the Lucene index at least as an
option - it would obviously increase the index size, though I suspect
the increase would be rather minor if you compare it to the overall (TDB
+ text index) database size. This would be similar to how LARQ used to
work, though LARQ only provides access to the values, not the subject
resources.

Slight caveat - size of index affects the speed of Lucene so it's notjust disk space compared to the size of the TDB database.

It would allow, with some additional code, having access to the actual
value from the SPARQL query. Something like this:

(?s ?value) text:query 'word' .

Then you could also easily check that the triple actually exists in
current RDF data (and in the current graph), with a pattern such as this:

?s rdfs:label ?value .


For me, it would probably allow some optimization of queries that
currently have to do a bit of detective work to find out which value
actually matched the query. I'm currently doing queries somewhat like this:

?s text:query (skos:altLabel 'word*') .
?s skos:altLabel ?value .
FILTER (STRSTARTS(?value, 'word'))

This is inefficient if there happen to be lots of skos:altLabel values,
as there are in e.g. AGROVOC thesaurus data.


How many skos:altLabel can occur in that dataset?

------

jena-text is a bit misnamed. It's an entity index : "find subjects suchthat ..." Entity indexes make the conjunctive use cases work, "findentities such that :property1 matches ... and :property2 matches ...".

The example above is closer to a text index (query -> literal) LARQcould do both in different configurations (not at the same time) throughpeople tended to use it as a text index and then look in the RDF to makeit an entity index. It can't in a single call do the conjunctive usecase nor be particularly easy to manage specific properties in differentways.

I have come to realise that we might provide both kinds of indexseparately. A tightly managed literal-text-index could have deeperintegration into query processing e.g. FILTER expressions.


        Andy


-Osma

Re: Storing values in Lucene index

Reply via email to