[ https://issues.apache.org/jira/browse/JENA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stephen Allen updated JENA-999: ------------------------------- Attachment: jena-text benchmarks.png I've attached a chart (note log-log scale) of some performance benchmarks. This is an in-memory TDB database with an in-memory lucene index. There are 10,000 entities ("test0" to "test9999") in the system. There are two statements per entity: {code} <http://example.org/test1234> rdf:type <http://example.org/Entity> . <http://example.org/test1234> rdfs:label "test1234" . {code} Jena-text indexes the rdfs:label predicate. The benchmarks show 5 different implementations for the following query: {code} select * where { ?s rdf:type <http://example.org/Entity> # lucene query string below varied to adjust the number of results ?s ?s text:query ( rdfs:label "test1*" ) } {code} The 5 implementations are as follows: # *Original* - the current jena-text implementation with a concrete subject # *Lucene Index Join* - Option 1 described above. Added the subject as a parameter to the lucene query # *Hash Join* - Option 2 described above. Perform the lucene query for all results, which are then cached in the ExecutionContext in a hashmap # *Non-Indexed* - The query above is modified to remove the {{text:query}} statement and instead simply retrieve the label from the RDF store and use {{strstarts()}} to filter the results # *Text Search First* - The current jena-text implemenation, but the query above is modified by switching the order of the two BGPs. This is the unbound subject case. >From these initial results, it seems worthwhile to try to implement a version >of option 2 (the hash join). This seems quite doable without the ugly hack of >storing stuff in the ExecutionContext by doing as you suggest and implementing >{{PropertyFunction}} ourselves instead of extending {{PropertyFunctionBase}}. > Poor jena-text query performance when a bound subject is used > ------------------------------------------------------------- > > Key: JENA-999 > URL: https://issues.apache.org/jira/browse/JENA-999 > Project: Apache Jena > Issue Type: Improvement > Reporter: Stephen Allen > Assignee: Stephen Allen > Priority: Minor > Attachments: jena-text benchmarks.png > > > When executing a jena-text query, the performance is terrible if the subject > is already bound to a variable. This is because the current code will > execute a new lucene query that does not have the subject/entity bound on > every iteration and then iterate through the lucene results to join against > the subject. This is quite inefficient. > Example query: > {code} > select * > where { > ?s rdf:type <http://example.org/Entity> . > ?s text:query ( rdfs:label "test" ) . > } > {code} > This would be quite slow if there were a lot of entities in the system. > Two potential solutions present themselves: > # Craft a more explicit lucene query that specifies the entity URI, so that > the results coming back from lucene are much smaller. However, this would > cause problems with the score not being correct across multiple iterations. > Additionally we are still potentially running a lot of lucene queries, each > of which has a probably non-negligble constant cost (parsing the query > string, etc). > # Execute the more general lucene query the first time it is encountered, > then caching the results somewhere. From there, we can then perform a hash > table lookup against those cached results. > I would like to pursue option 2, but there is a problem. Because jena-text > is implemented as a property function instead of a query op in and of itself > (like QueryIterMinus is for example), we have to find a place to stash the > lucene results. I believe this can be done by placing it in the > ExecutionContext object, using the lucene query as a cache key. Updates > provide a slightly troubling case because you could have an update request > like: > {code} > insert data { <urn:test1> rdf:type <http://example.org/Entity> ; rdfs:label > "test" } ; > delete { ?s ?p ?o } > where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label > "test" ) . ?p ?o . } ; > insert data { <urn:test2> rdf:type <http://example.org/Entity> ; rdfs:label > "test" } ; > delete { ?s ?p ?o } > where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label > "test" ) ; ?p ?o . } > {code} > And then the end result should be an empty database. But if the > ExecutionContext was the same for both delete queries, you would be using the > cached results from the first delete query in the second delete query, which > would result in {{<urn:test2>}} not being deleted properly. > If the ExecutionContext is indeed shared between the two update queries in > the situation above, I think this can be solved by making the cache key for > the lucene resultset be a combination of both the lucene query and the > QueryIterRoot or BindingRoot. I need to investigate this. An alternative, > if there was a way to be notified when a query has finished executing, we > could clear the cache in the ExecutionContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332)