Hi Wolfgang!

I have the same problem - I want to do fast prefix searches using SPARQL - and I've considered ways to implement it. Currently I'm just using a prefix query like "head*" in the text:query, and then filter the hits using a regex. (Thanks to Joshua for the strstarts trick, I'll have to try how much faster it is!)

The problem is that the StandardAnalyzer used by the jena-text Lucene index tokenizes the original literals into words, and Lucene then performs matches regardless of the position the word was in. So you will get matches even when the word (or prefix in this case) is not the first one in the literal, as in "DICOM Header Tag".

One way around this would be to tweak the jena-text Lucene implementation so that it doesn't tokenize the strings. Then I think you would get only real prefix matches. I haven't tried this with jena-text, but I've done similar things with plain Lucene in the past.

Currently jena-text is hardwired to use StandardAnalyzer with the default settings, you can't use anything else without altering the code. This was also a problem with LARQ and I've discussed it in the past on this list:
http://mail-archives.apache.org/mod_mbox/jena-users/201209.mbox/%3c50448b34.6050...@aalto.fi%3E

Another option would be to switch to using jena-text with Solr. This requires a bit more setting up as you have to run the Solr server daemon as well. But in Solr you can configure how the indexing is done using the schema.xml file, so you could easily ask it not to tokenize strings. I haven't tried this yet either, but it might be an option for you.

-Osma

On 14/11/13 17:40, huey...@aol.com wrote:
Hi Andy,

I tried "Head*" but it does not work like "starts-with".

"Head*" matches "DICOM Header Tag", which just "Head" does not. So that behaves 
as expected.

But it still does not solve my "starts-with" problem since "DICOM Header Tag" was returned as part 
of the results in the first place. I only want matches like "Head Carcinoma", "Head Injury" etc.

I checked out the two links you sent before posting this question. The tutorial 
mentions starts-with using the asterisk, but it matches any word in the text 
that starts-with the search string which is not what I am looking for.

How do I tell the text query that it should only look for matches at the start of the 
string? (like "^" in regex or strstarts).

-Wolfgang

-----Original Message-----

From: Andy Seaborne <a...@apache.org>
To: users <users@jena.apache.org>
Sent: Thu, Nov 14, 2013 3:44 pm
Subject: Re: Jena-text starts-wth


On 14/11/13 14:04, Joshua TAYLOR wrote:
On Thu, Nov 14, 2013 at 7:42 AM,  <huey...@aol.com> wrote:

I am using the following query to get all concepts that start with the word
"Head".


PREFIX text: <http://jena.apache.org/text#>
PREFIX nci: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT *
WHERE {
?s text:query (nci:Preferred_Name 'Head') .
?s nci:Preferred_Name ?prefName .
FILTER ( regex(?prefName, "^Head", "" ))
}


Is there a way of doing that in the text query itself without having to add a
FILTER?

Maybe the Jena Lucene combination can do something without a FILTER,
but I don't know much about that, and can't help you out there.  I
would point out, though, that you can make this FILTER less expensive
by using SPARQL 1.1's STRSTARTS:

      filter( strstarts( str(?prefName), "Head" ))





You can use the full Lucene query syntax:

     ?s text:query (nci:Preferred_Name 'Head*') .

http://www.lucenetutorial.com/lucene-query-syntax.html
http://lucene.apache.org/core/4_3_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html

on the default field.

        Andy








--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi

Reply via email to