I think your problem might be your text:query string here. You have ‘*Allergy*’ as your search string which has both leading and trailing wildcards, the leading wildcard is extremely expensive for Lucene to consider. With a leading wildcard it must check every unique term in your index against the search term to figure out which terms are considered matches, and thus which documents to return. A trailing wildcard is less bad but still requires Lucene to potentially consider a large swathe of possible terms.
Depending on the contents of your data, and what Lucene has indexed, it may be sufficient to simply use ‘Allergy’ as your search term. I would try using just that and see what performance you get. Then try adding back just the trailing wildcard i.e. ‘Allergy*’. I suspect the first (no wildcards) should be very fast BUT may not return anything depending on your data, the trailing wildcard only will be slower and may be sufficient if the non-wildcard form is not. In all cases which search string works for your use case is going to depend on your data. And it may be entirely possible that depending on the nature of your data (you state there’s only 18k triples for which you care about text indexing) that just using a plain SPARQL 1.1 CONTAINS() filter expression is sufficiently fast for these simple queries. Rob From: Goławski, Paweł <pawel.golaw...@cgm.com> Date: Wednesday, 15 June 2022 at 14:38 To: users@jena.apache.org <users@jena.apache.org> Subject: Jena Full Text Search poor performance Hi, I’m trying to use Jena Full Text Search feature according to https://jena.apache.org/documentation/query/text-query.html I’ve noticed that queries using “text:query” are very slow: ~20 times slower that similar using “FILTER contains” clause. There are ~5.5M triples in database, 18230 triples with indexed predicate. Database takes 1.3GB and index 4.2M disc space. Available memory for fuseki server is 16GB. My config is quite easy, there is nothing special configured: ################################################################################################ PREFIX : <#> PREFIX fuseki: http://jena.apache.org/fuseki#<http://jena.apache.org/fuseki> PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns> PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema> PREFIX ja: http://jena.hpl.hp.com/2005/11/Assembler#<http://jena.hpl.hp.com/2005/11/Assembler> PREFIX tdb: http://jena.hpl.hp.com/2008/tdb#<http://jena.hpl.hp.com/2008/tdb> PREFIX tdb2: http://jena.apache.org/2016/tdb#<http://jena.apache.org/2016/tdb> PREFIX text: http://jena.apache.org/text#<http://jena.apache.org/text> PREFIX skos: http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core> PREFIX fhir: http://hl7.org/fhir/ PREFIX tes: http://mycompany/tes/ [] rdf:type fuseki:Server ; fuseki:services ( :service ) . :service rdf:type fuseki:Service ; fuseki:name "tes" ; fuseki:serviceQuery "query" , "sparql" ; # SPARQL query service fuseki:serviceUpdate "update" ; # SPARQL update service fuseki:serviceReadWriteGraphStore "data" ; # SPARQL Graph store protocol (read and write) fuseki:serviceReadGraphStore "get" ; fuseki:serviceUpload "upload" ; fuseki:dataset :text_dataset ; . # A TextDataset is a regular dataset with a text index. :text_dataset rdf:type text:TextDataset ; text:dataset :tdb2_dataset_readwrite; text:index :indexLucene ; . # A TDB dataset used for RDF storage :tdb2_dataset_readwrite rdf:type tdb2:DatasetTDB ; tdb2:location "databases/db" ; . :indexLucene a text:TextIndexLucene ; text:directory "databases/db-index" ; text:entityMap :entMap ; text:storeValues true ; text:analyzer [ a text:StandardAnalyzer ; # text:stopWords ("the" "a" "an" "and" "but") ] ; # text:queryAnalyzer [ a text:StandardAnalyzer ] ; text:queryParser text:QueryParser ; # text:multilingualSupport true ; # optional . # Entity map (see documentation for other options) :entMap a text:EntityMap ; text:defaultField "tesValue" ; text:entityField "uri" ; text:uidField "uid" ; text:langField "lang" ; text:graphField "graph" ; text:map ( [ text:field "tesValue" ; text:predicate tes:indexedValue ] ) . ################################################################################################ There are very similar SPARQL queries: 1. with “text:query” clause: PREFIX tes: http://mycompany/tes/ PREFIX fhir: http://hl7.org/fhir/ PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns> PREFIX owl: http://www.w3.org/2002/07/owl#<http://www.w3.org/2002/07/owl> PREFIX xsd: http://www.w3.org/2001/XMLSchema#<http://www.w3.org/2001/XMLSchema> PREFIX skos: http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core> PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema> PREFIX text: http://jena.apache.org/text#<http://jena.apache.org/text> SELECT DISTINCT ?this ?json WHERE { ?this rdf:type fhir:CodeSystem . ?this fhir:Resource.jsonContent/fhir:value ?json . ?this fhir:CodeSystem.name/text:query (tes:indexedValue '*Allergy*') } 2. and with “FILTER contains” clause: PREFIX tes: http://cgm.com/tes/ PREFIX fhir: http://hl7.org/fhir/ PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns> PREFIX owl: http://www.w3.org/2002/07/owl#<http://www.w3.org/2002/07/owl> PREFIX xsd: http://www.w3.org/2001/XMLSchema#<http://www.w3.org/2001/XMLSchema> PREFIX skos: http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core> PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema> PREFIX text: http://jena.apache.org/text#<http://jena.apache.org/text> SELECT DISTINCT ?this ?json WHERE { ?this rdf:type fhir:CodeSystem . ?this fhir:Resource.jsonContent/fhir:value ?json . ?this fhir:CodeSystem.name/tes:indexedValue ?name FILTER contains(?name, "Allergy") } ========================================================================================== Log from fuseki: 15:19:33 INFO Fuseki :: [4] POST http://localhost:3030/tes/sparql 15:19:33 INFO Fuseki :: [4] Query = PREFIX tes: http://mycomany/tes/ PREFIX fhir: http://hl7.org/fhir/ PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns> PREFIX owl: http://www.w3.org/2002/07/owl#<http://www.w3.org/2002/07/owl> PREFIX xsd: http://www.w3.org/2001/XMLSchema#<http://www.w3.org/2001/XMLSchema> PREFIX skos: http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core> PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema> PREFIX text: http://jena.apache.org/text#<http://jena.apache.org/text> SELECT DISTINCT ?this ?json WHERE { ?this rdf:type fhir:CodeSystem . ?this fhir:Resource.jsonContent/fhir:value ?json . ?this fhir:CodeSystem.name/tes:indexedValue ?name FILTER contains(?name, "Allergy") } 15:19:33 INFO Fuseki :: [4] 200 OK (55 ms) 15:20:25 INFO Fuseki :: [5] POST http://localhost:3030/tes/sparql 15:20:25 INFO Fuseki :: [5] Query = PREFIX tes: http://mycomany/tes/ PREFIX fhir: http://hl7.org/fhir/ PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns> PREFIX owl: http://www.w3.org/2002/07/owl#<http://www.w3.org/2002/07/owl> PREFIX xsd: http://www.w3.org/2001/XMLSchema#<http://www.w3.org/2001/XMLSchema> PREFIX skos: http://www.w3.org/2004/02/skos/core#<http://www.w3.org/2004/02/skos/core> PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema> PREFIX text: http://jena.apache.org/text#<http://jena.apache.org/text> SELECT DISTINCT ?this ?json WHERE { ?this rdf:type fhir:CodeSystem . ?this fhir:Resource.jsonContent/fhir:value ?json . ?this fhir:CodeSystem.name/text:query (tes:indexedValue '*Allergy*') } 15:20:36 INFO Fuseki :: [5] 200 OK (10,888 s) ========================================================================================== There is no difference between standard and docker installations. I even found bug https://issues.apache.org/jira/browse/JENA-999 regarding performance, which is already fixed in version 3.1.0 , while I’m currently using version 4.4.0. Did anyone notice the same problem? Or maybe I’m doing something wrong? Or I must do some additional magic configuration? Is there any solution for this problem?