RE: 'standardizing' on one or more predicates for text search in SPARQL?

Seaborne, Andy Sun, 17 Aug 2008 09:20:25 -0700


> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:public-sparql-dev-
> [EMAIL PROTECTED] On Behalf Of Lee Feigenbaum
> Sent: 16 August 2008 21:50
> To: [email protected]
> Subject: 'standardizing' on one or more predicates for text search in
> SPARQL?
>
>
> Many SPARQL engines contain support for a magic/computed/functional
> predicate that can be used to relate a literal subject (?o if you will)
> to a text search string.
>
> See http://esw.w3.org/topic/SPARQL/Extensions/Computed_Properties for
> links to some examples.
>
> Right now, different implementations use different predicates. As far as
> I can tell:
>
> ARQ (Jena): http://jena.hpl.hp.com/ARQ/property#
> Virtuoso: bif:contains  (though I can't tell what prefix bif:
> corresponds to)
> Glitter (Open Anzo): http://openanzo.org/predicates/textmatch
> AllegroGraph: http://franz.com/ns/allegrograph/2.2/textindex/match


ARQ didn't invent property functions.  Before that was cwm which has a 
predicate mechanism which it uses (amongst other things) for regexes with 
"string:matches".

> A couple of questions:
>
> 1) What is the search syntax of these predicates? For example, the
> object of Glitter's textmatch is a Lucene search string. I think (but am
> not sure) that ARQ is the same, and I'm not sure about the others.

ARQ documentation: http://jena.sourceforge.net/ARQ/lucene-arq.html
ARQ uses Lucene for all the real work - the free text index and the syntax is 
that for Lucene language (AND, OR, proximity, fuzzy match).

http://lucene.apache.org/java/2_3_2/queryparsersyntax.html

The simple form is:
    ?lit pf:textMatch '+text' .

The search string is Lucene syntax (and is a passed unchanged to Lucene).

This form can also be used for the finding documents that have content that 
matches the search:
    ?uri pf:textMatch '+text' .

because it not fixed that the Lucene index contains the associated literal.  It 
could be the text to the URI of the document causing the match.

A constant (or already bound) subject simply requires an exact match to the 
index value.  So the index can be used as a restrictive or generative index.

The most complex form uses RDF lists for arguments:
  # Limit to scores of 0.5 and limit to 100 hits (object slot)
  # Return the literals matched and the score (subject slot)
  (?lit ?score ) pf:textMatch ( '+text' 0.5 100 ) .

For free text, the subject slot are outputs, the object slot inputs.  Inputs 
can be variables but must be already bound by the time a call is made.  Fixed 
outputs are matched for equality. Because not all functions, in practical 
terms, work both ways, there has to be additional rules for evaluation of 
property functions as BGPs.

> 2) Do we have any hope of reconciling these to promote more
> interoperable queries of this sort? At the least, are implementors
> willing to support all 4 of these predicates (and perhaps others)
> interchangeably?

Yes although I'd rather implement a commonly agree one rather than 4 (slightly 
different) forms.

>
> 3) Is there any value in coining an "implementation-independent" URI for
> textsearch and adding that to existing implementations?

Yes - it would be valuable to have a common form that covers the basic cases 
and is independent of implementation technology.

It's probably more valuable to be the simple(r) case to increase the number of 
implementations.  So more complex argument forms, and more complex text 
searches shouldn't be covered as being mandatory.

Ditto text matching language.  A core more-widely available form is more useful 
than a less widely provided complex language.

> 4) Do existing implementations compile simple invocations of the SPARQL
> regex filter function into uses of text-search indexes? Is regex(...)
> the best way to interoperably _and_ efficiently perform SPARQL text
> match queries? (This has come to light in the recent Berlin benchmark
> SPARQL queries.)

ARQ does not equate the two.  A regex is a yes/no exact match; free text is a 
best match (hence scores and needing to limit the number of hits).  That makes 
defining the right answers for an agreed predicate rather hard - there are 
tradeoffs in the free text engine to be made.

There are lots of things that can be done to speed up regex and we have a 
standard regex language from XSD but it's not free text searching.

>  From my point of view as an implementor, I'd be happy to support other
> predicates and/or an agreed upon implementation-neutral predicate in
> Glitter, though I'd want to be clear on the syntax of the search string
> itself. Glitter doesn't currently compile regex(...) into
> anzo:textmatch, but I've been intending to add that support in the light
> of the Berlin query benchmark suite.
>
> Lee

I'd like to see something that can be widely supported.  Let's also recognize 
that there is a cost - a split between "large" and "small" implementations 
would be a bad thing.

Hope that helps,

        Andy

RE: 'standardizing' on one or more predicates for text search in SPARQL?

Reply via email to