[ https://issues.apache.org/jira/browse/JENA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091544#comment-15091544 ]
ASF GitHub Bot commented on JENA-999: ------------------------------------- Github user osma commented on the pull request: https://github.com/apache/jena/pull/119#issuecomment-170452765 I decided to merge as-is (after rebase), i.e. using `getOrFill` and the lambda function. It's not too unclear either way. I'm just not yet used to this style of Java programming, it's just more natural in other languages. > Poor jena-text query performance when a bound subject is used > ------------------------------------------------------------- > > Key: JENA-999 > URL: https://issues.apache.org/jira/browse/JENA-999 > Project: Apache Jena > Issue Type: Improvement > Reporter: Stephen Allen > Assignee: Stephen Allen > Priority: Minor > Attachments: PerformanceTester.java, jena-text benchmarks.png > > > When executing a jena-text query, the performance is terrible if the subject > is already bound to a variable. This is because the current code will > execute a new lucene query that does not have the subject/entity bound on > every iteration and then iterate through the lucene results to join against > the subject. This is quite inefficient. > Example query: > {code} > select * > where { > ?s rdf:type <http://example.org/Entity> . > ?s text:query ( rdfs:label "test" ) . > } > {code} > This would be quite slow if there were a lot of entities in the system. > Two potential solutions present themselves: > # Craft a more explicit lucene query that specifies the entity URI, so that > the results coming back from lucene are much smaller. However, this would > cause problems with the score not being correct across multiple iterations. > Additionally we are still potentially running a lot of lucene queries, each > of which has a probably non-negligble constant cost (parsing the query > string, etc). > # Execute the more general lucene query the first time it is encountered, > then caching the results somewhere. From there, we can then perform a hash > table lookup against those cached results. > I would like to pursue option 2, but there is a problem. Because jena-text > is implemented as a property function instead of a query op in and of itself > (like QueryIterMinus is for example), we have to find a place to stash the > lucene results. I believe this can be done by placing it in the > ExecutionContext object, using the lucene query as a cache key. Updates > provide a slightly troubling case because you could have an update request > like: > {code} > insert data { <urn:test1> rdf:type <http://example.org/Entity> ; rdfs:label > "test" } ; > delete { ?s ?p ?o } > where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label > "test" ) . ?p ?o . } ; > insert data { <urn:test2> rdf:type <http://example.org/Entity> ; rdfs:label > "test" } ; > delete { ?s ?p ?o } > where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label > "test" ) ; ?p ?o . } > {code} > And then the end result should be an empty database. But if the > ExecutionContext was the same for both delete queries, you would be using the > cached results from the first delete query in the second delete query, which > would result in {{<urn:test2>}} not being deleted properly. > If the ExecutionContext is indeed shared between the two update queries in > the situation above, I think this can be solved by making the cache key for > the lucene resultset be a combination of both the lucene query and the > QueryIterRoot or BindingRoot. I need to investigate this. An alternative, > if there was a way to be notified when a query has finished executing, we > could clear the cache in the ExecutionContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332)