[jira] [Comment Edited] (JENA-999) Poor jena-text query performance when a bound subject is used

Osma Suominen (JIRA) Tue, 22 Dec 2015 11:25:04 -0800

    [ 
https://issues.apache.org/jira/browse/JENA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068587#comment-15068587
 ]


Osma Suominen edited comment on JENA-999 at 12/22/15 7:24 PM:
--------------------------------------------------------------

I've tried to implement a very simple cache (based on Andy's ideas in the 
previous comment) around the call to Lucene. See here:
https://github.com/osma/jena/commit/2044af9317422654e290e6b7c6585932346a6dff
(note that this is on top of my recent additional fix to JENA-1093 which hasn't 
yet been merged to jena master - see PR #112)

This simple cache seems to accomplish most of what Stephen Allen's first commit 
did but is much less invasive. It dramatically reduces query time in simple 
cases such as the above benchmark query (e.g. from 70 seconds to 700 ms with 
the query that returns 1111 results). But with a more complex query involving 
UNION (again from Stephen's example in the attached PerformanceTester.java), it 
doesn't help, as the TextQueryPF seems to be recreated on every iteration i.e. 
10000 times with the benchmark data set. The cache would have to be stored 
elsewhere (in the Context?) so that it gets persisted. But that also requires 
some cache management so that memory is not leaked over time.

Also another obvious problem which is not addressed is that in the bound 
subject case, the algorithm to select the correct bindings to return (the ones 
with the right subject) requires traversing the whole list of Lucene results 
each time. A smarter cache implemented using a Multimap keyed on subject URI 
(similar to Stephen's original cache but allowing multiple values per key) 
would be able to do much better. But you have to start somewhere...




was (Author: osma):
I've tried to implement a very simple cache (based on Andy's ideas in the 
previous commit) around the call to Lucene. See here:
https://github.com/osma/jena/commit/2044af9317422654e290e6b7c6585932346a6dff
(note that this is on top of my recent additional fix to JENA-1093 which hasn't 
yet been merged to jena master - see PR #112)

This simple cache seems to accomplish most of what Stephen Allen's first commit 
did but is much less invasive. It dramatically reduces query time in simple 
cases such as the above benchmark query (e.g. from 70 seconds to 700 ms with 
the query that returns 1111 results). But with a more complex query involving 
UNION (again from Stephen's example in the attached PerformanceTester.java), it 
doesn't help, as the TextQueryPF seems to be recreated on every iteration i.e. 
10000 times with the benchmark data set. The cache would have to be stored 
elsewhere (in the Context?) so that it gets persisted. But that also requires 
some cache management so that memory is not leaked over time.

Also another obvious problem which is not addressed is that in the bound 
subject case, the algorithm to select the correct bindings to return (the ones 
with the right subject) requires traversing the whole list of Lucene results 
each time. A smarter cache implemented using a Multimap keyed on subject URI 
(similar to Stephen's original cache but allowing multiple values per key) 
would be able to do much better. But you have to start somewhere...



> Poor jena-text query performance when a bound subject is used
> -------------------------------------------------------------
>
>                 Key: JENA-999
>                 URL: https://issues.apache.org/jira/browse/JENA-999
>             Project: Apache Jena
>          Issue Type: Improvement
>            Reporter: Stephen Allen
>            Assignee: Stephen Allen
>            Priority: Minor
>         Attachments: PerformanceTester.java, jena-text benchmarks.png
>
>
> When executing a jena-text query, the performance is terrible if the subject 
> is already bound to a variable.  This is because the current code will 
> execute a new lucene query that does not have the subject/entity bound on 
> every iteration and then iterate through the lucene results to join against 
> the subject.  This is quite inefficient.
> Example query:
> {code}
> select *
> where {
>   ?s rdf:type <http://example.org/Entity> .
>   ?s text:query ( rdfs:label "test" ) .
> }
> {code}
> This would be quite slow if there were a lot of entities in the system.
> Two potential solutions present themselves:
> # Craft a more explicit lucene query that specifies the entity URI, so that 
> the results coming back from lucene are much smaller.  However, this would 
> cause problems with the score not being correct across multiple iterations.  
> Additionally we are still potentially running a lot of lucene queries, each 
> of which has a probably non-negligble constant cost (parsing the query 
> string, etc).
> # Execute the more general lucene query the first time it is encountered, 
> then caching the results somewhere.  From there, we can then perform a hash 
> table lookup against those cached results.
> I would like to pursue option 2, but there is a problem.  Because jena-text 
> is implemented as a property function instead of a query op in and of itself 
> (like QueryIterMinus is for example), we have to find a place to stash the 
> lucene results.  I believe this can be done by placing it in the 
> ExecutionContext object, using the lucene query as a cache key.  Updates 
> provide a slightly troubling case because you could have an update request 
> like:
> {code}
> insert data { <urn:test1> rdf:type <http://example.org/Entity> ; rdfs:label 
> "test" } ;
> delete { ?s ?p ?o }
> where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label 
> "test" ) . ?p ?o . } ;
> insert data { <urn:test2> rdf:type <http://example.org/Entity> ; rdfs:label 
> "test" } ;
> delete { ?s ?p ?o }
> where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label 
> "test" ) ; ?p ?o . }
> {code}
> And then the end result should be an empty database.  But if the 
> ExecutionContext was the same for both delete queries, you would be using the 
> cached results from the first delete query in the second delete query, which 
> would result in {{<urn:test2>}} not being deleted properly.
> If the ExecutionContext is indeed shared between the two update queries in 
> the situation above, I think this can be solved by making the cache key for 
> the lucene resultset be a combination of both the lucene query and the 
> QueryIterRoot or BindingRoot.  I need to investigate this.  An alternative, 
> if there was a way to be notified when a query has finished executing, we 
> could clear the cache in the ExecutionContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (JENA-999) Poor jena-text query performance when a bound subject is used

Reply via email to