Re: Jena-text starts-wth

Osma Suominen Sun, 17 Nov 2013 22:47:45 -0800

Hi Wolfgang!

I have the same problem - I want to do fast prefix searches using SPARQL- and I've considered ways to implement it. Currently I'm just using aprefix query like "head*" in the text:query, and then filter the hitsusing a regex. (Thanks to Joshua for the strstarts trick, I'll have totry how much faster it is!)

The problem is that the StandardAnalyzer used by the jena-text Luceneindex tokenizes the original literals into words, and Lucene thenperforms matches regardless of the position the word was in. So you willget matches even when the word (or prefix in this case) is not the firstone in the literal, as in "DICOM Header Tag".

One way around this would be to tweak the jena-text Luceneimplementation so that it doesn't tokenize the strings. Then I think youwould get only real prefix matches. I haven't tried this with jena-text,but I've done similar things with plain Lucene in the past.

Currently jena-text is hardwired to use StandardAnalyzer with thedefault settings, you can't use anything else without altering the code.This was also a problem with LARQ and I've discussed it in the past onthis list:

http://mail-archives.apache.org/mod_mbox/jena-users/201209.mbox/%3c50448b34.6050...@aalto.fi%3E

Another option would be to switch to using jena-text with Solr. Thisrequires a bit more setting up as you have to run the Solr server daemonas well. But in Solr you can configure how the indexing is done usingthe schema.xml file, so you could easily ask it not to tokenize strings.I haven't tried this yet either, but it might be an option for you.


-Osma

On 14/11/13 17:40, huey...@aol.com wrote:

Hi Andy,

I tried "Head*" but it does not work like "starts-with".

"Head*" matches "DICOM Header Tag", which just "Head" does not. So that behaves 
as expected.

But it still does not solve my "starts-with" problem since "DICOM Header Tag" was returned as part 
of the results in the first place. I only want matches like "Head Carcinoma", "Head Injury" etc.

I checked out the two links you sent before posting this question. The tutorial 
mentions starts-with using the asterisk, but it matches any word in the text 
that starts-with the search string which is not what I am looking for.

How do I tell the text query that it should only look for matches at the start of the 
string? (like "^" in regex or strstarts).

-Wolfgang

-----Original Message-----

From: Andy Seaborne <a...@apache.org>
To: users <users@jena.apache.org>
Sent: Thu, Nov 14, 2013 3:44 pm
Subject: Re: Jena-text starts-wth


On 14/11/13 14:04, Joshua TAYLOR wrote:

On Thu, Nov 14, 2013 at 7:42 AM,  <huey...@aol.com> wrote:


I am using the following query to get all concepts that start with the word

"Head".



PREFIX text: <http://jena.apache.org/text#>
PREFIX nci: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT *
WHERE {
?s text:query (nci:Preferred_Name 'Head') .
?s nci:Preferred_Name ?prefName .
FILTER ( regex(?prefName, "^Head", "" ))
}


Is there a way of doing that in the text query itself without having to add a

FILTER?


Maybe the Jena Lucene combination can do something without a FILTER,
but I don't know much about that, and can't help you out there.  I
would point out, though, that you can make this FILTER less expensive
by using SPARQL 1.1's STRSTARTS:

      filter( strstarts( str(?prefName), "Head" ))


You can use the full Lucene query syntax:

     ?s text:query (nci:Preferred_Name 'Head*') .

http://www.lucenetutorial.com/lucene-query-syntax.html
http://lucene.apache.org/core/4_3_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html

on the default field.

        Andy



--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi

Re: Jena-text starts-wth

Reply via email to