On 26/02/15 18:37, Stephen Allen wrote:
I would propose in the future that we actual store and not
just index the document so that it can be appropriately identified and
deleted.  This would require a change to existing Lucene databases (we
should provide a tool to reindex existing data).  An alternative to
actually storing the value would be to generate a hash of the
subject+predicate+object and store that as an identifier.

I second storing the original value in the Lucene index at least as an option - it would obviously increase the index size, though I suspect the increase would be rather minor if you compare it to the overall (TDB + text index) database size. This would be similar to how LARQ used to work, though LARQ only provides access to the values, not the subject resources.

It would allow, with some additional code, having access to the actual value from the SPARQL query. Something like this:

(?s ?value) text:query 'word' .

Then you could also easily check that the triple actually exists in current RDF data (and in the current graph), with a pattern such as this:

?s rdfs:label ?value .


For me, it would probably allow some optimization of queries that currently have to do a bit of detective work to find out which value actually matched the query. I'm currently doing queries somewhat like this:

?s text:query (skos:altLabel 'word*') .
?s skos:altLabel ?value .
FILTER (STRSTARTS(?value, 'word'))

This is inefficient if there happen to be lots of skos:altLabel values, as there are in e.g. AGROVOC thesaurus data.

-Osma


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi

Reply via email to