On 27/02/15 06:58, Osma Suominen wrote:
On 26/02/15 18:37, Stephen Allen wrote:
I would propose in the future that we actual store and not
just index the document so that it can be appropriately identified and
deleted.  This would require a change to existing Lucene databases (we
should provide a tool to reindex existing data).  An alternative to
actually storing the value would be to generate a hash of the
subject+predicate+object and store that as an identifier.

The same literal may be in the RDF graph multiple times. It's a reference counting problem; maintaining a reference would be expensive and limit scale.

I second storing the original value in the Lucene index at least as an
option - it would obviously increase the index size, though I suspect
the increase would be rather minor if you compare it to the overall (TDB
+ text index) database size. This would be similar to how LARQ used to
work, though LARQ only provides access to the values, not the subject
resources.

Slight caveat - size of index affects the speed of Lucene so it's not just disk space compared to the size of the TDB database.

It would allow, with some additional code, having access to the actual
value from the SPARQL query. Something like this:

(?s ?value) text:query 'word' .

Then you could also easily check that the triple actually exists in
current RDF data (and in the current graph), with a pattern such as this:

?s rdfs:label ?value .


For me, it would probably allow some optimization of queries that
currently have to do a bit of detective work to find out which value
actually matched the query. I'm currently doing queries somewhat like this:

?s text:query (skos:altLabel 'word*') .
?s skos:altLabel ?value .
FILTER (STRSTARTS(?value, 'word'))

This is inefficient if there happen to be lots of skos:altLabel values,
as there are in e.g. AGROVOC thesaurus data.

How many skos:altLabel can occur in that dataset?

------
jena-text is a bit misnamed. It's an entity index : "find subjects such that ..." Entity indexes make the conjunctive use cases work, "find entities such that :property1 matches ... and :property2 matches ...".

The example above is closer to a text index (query -> literal) LARQ could do both in different configurations (not at the same time) through people tended to use it as a text index and then look in the RDF to make it an entity index. It can't in a single call do the conjunctive use case nor be particularly easy to manage specific properties in different ways.

I have come to realise that we might provide both kinds of index separately. A tightly managed literal-text-index could have deeper integration into query processing e.g. FILTER expressions.

        Andy




-Osma



Reply via email to