Re: Jena Full Text Search poor performance

rve...@dotnetrdf.org Thu, 16 Jun 2022 02:25:57 -0700

I think your problem might be your text:query string here. You have ‘*Allergy*’ 
as your search string which has both leading and trailing wildcards, the 
leading wildcard is extremely expensive for Lucene to consider.  With a leading 
wildcard it must check every unique term in your index against the search term 
to figure out which terms are considered matches, and thus which documents to 
return.  A trailing wildcard is less bad but still requires Lucene to 
potentially consider a large swathe of possible terms.


Depending on the contents of your data, and what Lucene has indexed, it may be 
sufficient to simply use ‘Allergy’ as your search term.  I would try using just 
that and see what performance you get.  Then try adding back just the trailing 
wildcard i.e. ‘Allergy*’.  I suspect the first (no wildcards) should be very 
fast BUT may not return anything depending on your data, the trailing wildcard 
only will be slower and may be sufficient if the non-wildcard form is not.

In all cases which search string works for your use case is going to depend on 
your data.  And it may be entirely possible that depending on the nature of 
your data (you state there’s only 18k triples for which you care about text 
indexing) that just using a plain SPARQL 1.1 CONTAINS() filter expression is 
sufficiently fast for these simple queries.

Rob

From: Goławski, Paweł <pawel.golaw...@cgm.com>
Date: Wednesday, 15 June 2022 at 14:38
To: users@jena.apache.org <users@jena.apache.org>
Subject: Jena Full Text Search poor performance
Hi,
I’m trying to use Jena Full Text Search feature according to 
https://jena.apache.org/documentation/query/text-query.html
I’ve noticed that queries using “text:query” are very slow: ~20 times slower 
that similar using “FILTER contains” clause.
There are ~5.5M triples in database, 18230 triples with indexed predicate.
Database takes 1.3GB and index 4.2M disc space.
Available memory for fuseki server is 16GB.

My config is quite easy, there is nothing special configured:

################################################################################################

PREFIX :        <#>
PREFIX fuseki:  http://jena.apache.org/fuseki#<http://jena.apache.org/fuseki>
PREFIX rdf:     
http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns>
PREFIX rdfs:    
http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema>
PREFIX ja:      
http://jena.hpl.hp.com/2005/11/Assembler#<http://jena.hpl.hp.com/2005/11/Assembler>
PREFIX tdb:     
http://jena.hpl.hp.com/2008/tdb#<http://jena.hpl.hp.com/2008/tdb>
PREFIX tdb2:    
http://jena.apache.org/2016/tdb#<http://jena.apache.org/2016/tdb>
PREFIX text:    http://jena.apache.org/text#<http://jena.apache.org/text>
PREFIX skos:    
http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core>
PREFIX fhir:    http://hl7.org/fhir/
PREFIX tes:     http://mycompany/tes/

[] rdf:type fuseki:Server ;
   fuseki:services (
                       :service
                   ) .

:service rdf:type fuseki:Service ;
                     fuseki:name "tes" ;
                     fuseki:serviceQuery               "query" , "sparql" ;    
# SPARQL query service
                     fuseki:serviceUpdate              "update" ;   # SPARQL 
update service
                     fuseki:serviceReadWriteGraphStore "data" ;     # SPARQL 
Graph store protocol (read and write)
                     fuseki:serviceReadGraphStore      "get" ;
                     fuseki:serviceUpload              "upload" ;
                     fuseki:dataset :text_dataset ;
.

# A TextDataset is a regular dataset with a text index.
:text_dataset rdf:type    text:TextDataset ;
                          text:dataset   :tdb2_dataset_readwrite;
                          text:index     :indexLucene ;
.

# A TDB dataset used for RDF storage
:tdb2_dataset_readwrite rdf:type tdb2:DatasetTDB ;
    tdb2:location  "databases/db" ;
.


:indexLucene a text:TextIndexLucene ;
     text:directory "databases/db-index" ;
     text:entityMap :entMap ;
     text:storeValues true ;
     text:analyzer [
                       a text:StandardAnalyzer ;
#                       text:stopWords ("the" "a" "an" "and" "but")
                   ] ;
#    text:queryAnalyzer [ a text:StandardAnalyzer ] ;
     text:queryParser text:QueryParser ;
# text:multilingualSupport true ; # optional
.
# Entity map (see documentation for other options)
:entMap a text:EntityMap ;
            text:defaultField     "tesValue" ;
            text:entityField      "uri" ;
            text:uidField         "uid" ;
            text:langField        "lang" ;
            text:graphField       "graph" ;
            text:map (
                         [ text:field "tesValue" ;
                           text:predicate tes:indexedValue
                         ]
                     )
.

################################################################################################



There are very similar SPARQL queries:

1.  with “text:query” clause:



PREFIX  tes:  http://mycompany/tes/

PREFIX  fhir: http://hl7.org/fhir/

PREFIX  rdf:  
http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns>

PREFIX  owl:  http://www.w3.org/2002/07/owl#<http://www.w3.org/2002/07/owl>

PREFIX  xsd:  
http://www.w3.org/2001/XMLSchema#<http://www.w3.org/2001/XMLSchema>

PREFIX  skos: 
http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core>

PREFIX  rdfs: 
http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema>

PREFIX  text: http://jena.apache.org/text#<http://jena.apache.org/text>



SELECT DISTINCT  ?this ?json

WHERE

  { ?this  rdf:type  fhir:CodeSystem .

    ?this fhir:Resource.jsonContent/fhir:value ?json .

    ?this fhir:CodeSystem.name/text:query (tes:indexedValue '*Allergy*')

  }



2.  and with “FILTER contains” clause:



PREFIX  tes:  http://cgm.com/tes/

PREFIX  fhir: http://hl7.org/fhir/

PREFIX  rdf:  
http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns>

PREFIX  owl:  http://www.w3.org/2002/07/owl#<http://www.w3.org/2002/07/owl>

PREFIX  xsd:  
http://www.w3.org/2001/XMLSchema#<http://www.w3.org/2001/XMLSchema>

PREFIX  skos: 
http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core>

PREFIX  rdfs: 
http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema>

PREFIX  text: http://jena.apache.org/text#<http://jena.apache.org/text>



SELECT DISTINCT  ?this ?json

WHERE

  { ?this  rdf:type  fhir:CodeSystem .

    ?this fhir:Resource.jsonContent/fhir:value ?json .

    ?this fhir:CodeSystem.name/tes:indexedValue ?name FILTER contains(?name, 
"Allergy")

  }
==========================================================================================

Log from fuseki:



15:19:33 INFO  Fuseki          :: [4] POST http://localhost:3030/tes/sparql

15:19:33 INFO  Fuseki          :: [4] Query = PREFIX  tes:  
http://mycomany/tes/ PREFIX  fhir: http://hl7.org/fhir/ PREFIX  rdf:  
http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns>
 PREFIX  owl:  http://www.w3.org/2002/07/owl#<http://www.w3.org/2002/07/owl> 
PREFIX  xsd:  
http://www.w3.org/2001/XMLSchema#<http://www.w3.org/2001/XMLSchema> PREFIX  
skos: http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core> 
PREFIX  rdfs: 
http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema> 
PREFIX  text: http://jena.apache.org/text#<http://jena.apache.org/text>  SELECT 
DISTINCT  ?this ?json WHERE   { ?this  rdf:type  fhir:CodeSystem .     ?this 
fhir:Resource.jsonContent/fhir:value ?json .      ?this 
fhir:CodeSystem.name/tes:indexedValue ?name FILTER contains(?name, "Allergy")   
}

15:19:33 INFO  Fuseki          :: [4] 200 OK (55 ms)



15:20:25 INFO  Fuseki          :: [5] POST http://localhost:3030/tes/sparql

15:20:25 INFO  Fuseki          :: [5] Query = PREFIX  tes:  
http://mycomany/tes/ PREFIX  fhir: http://hl7.org/fhir/ PREFIX  rdf:  
http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns>
 PREFIX  owl:  http://www.w3.org/2002/07/owl#<http://www.w3.org/2002/07/owl> 
PREFIX  xsd:  
http://www.w3.org/2001/XMLSchema#<http://www.w3.org/2001/XMLSchema> PREFIX  
skos: http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core> 
PREFIX  rdfs: 
http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema> 
PREFIX  text: http://jena.apache.org/text#<http://jena.apache.org/text>  SELECT 
DISTINCT  ?this ?json WHERE   { ?this  rdf:type  fhir:CodeSystem .     ?this 
fhir:Resource.jsonContent/fhir:value ?json .      ?this 
fhir:CodeSystem.name/text:query (tes:indexedValue '*Allergy*')   }

15:20:36 INFO  Fuseki          :: [5] 200 OK (10,888 s)
==========================================================================================

There is no difference between standard and docker installations.
I even found bug https://issues.apache.org/jira/browse/JENA-999 regarding 
performance, which is already fixed in version 3.1.0 , while I’m currently 
using version 4.4.0.
Did anyone notice the same problem?
Or maybe I’m doing something wrong?
Or I must do some additional magic configuration?
Is there any solution for this problem?

Re: Jena Full Text Search poor performance

Reply via email to