[jira] [Commented] (JENA-999) Poor jena-text query performance when a bound subject is used

ASF GitHub Bot (JIRA) Tue, 05 Jan 2016 12:42:04 -0800

    [ 
https://issues.apache.org/jira/browse/JENA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083737#comment-15083737
 ]


ASF GitHub Bot commented on JENA-999:
-------------------------------------

GitHub user osma opened a pull request:

    https://github.com/apache/jena/pull/119

    JENA-999: jena-text Lucene cache using multimaps

    This set of commits implements a caching layer for Lucene queries. The 
cache is stored in the Context so that it is persisted even when new 
TextQueryPF's are created. Cache entries for query results are Guava Multimaps, 
which allow efficient lookups of known subject URIs in the case where the 
subject is already bound.
    
    @afs I hope I did the Context storage right. You said it will have the 
right lifetime and I hope that's true since otherwise memory leaks may occur. I 
looked at Stephen Allen's example from the jena-text-cache experimental branch: 
https://github.com/apache/jena/commit/45081fabe012c56b3fc7ae6a92b4518245779eb2
    
    I have verified that this gives good performance with Stephen's example 
queries, even in the UNION case where TextQueryPF is recreated over and over. 
For example, a query with 11,111 results is answered in less than 300 ms.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/osma/jena jena-text-lucene-cache

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jena/pull/119.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #119
    
----
commit a7bb1094a1750492c290d03ad3957d8fe42d4e2c
Author: Osma Suominen <osma.suomi...@aalto.fi>
Date:   2015-12-22T16:45:50Z

    very simple caching of Lucene query results in a hash map

commit af302e2b5cfa3ff2db9e1901dc36df547b1c4bad
Author: Osma Suominen <osma.suomi...@aalto.fi>
Date:   2016-01-05T20:05:31Z

    move Lucene query cache to Context for some persistence

commit b54e38bc00cfa3ddbb3969c4d8fb1efe658af9ea
Author: Osma Suominen <osma.suomi...@aalto.fi>
Date:   2016-01-05T20:07:24Z

    remove unused import

commit 718d275a7c5f160a0050ba392fdc1affadea093a
Author: Osma Suominen <osma.suomi...@aalto.fi>
Date:   2016-01-05T20:34:02Z

    store Multimaps in the cache for more efficient retrieval of known subject 
URIs

----


> Poor jena-text query performance when a bound subject is used
> -------------------------------------------------------------
>
>                 Key: JENA-999
>                 URL: https://issues.apache.org/jira/browse/JENA-999
>             Project: Apache Jena
>          Issue Type: Improvement
>            Reporter: Stephen Allen
>            Assignee: Stephen Allen
>            Priority: Minor
>         Attachments: PerformanceTester.java, jena-text benchmarks.png
>
>
> When executing a jena-text query, the performance is terrible if the subject 
> is already bound to a variable.  This is because the current code will 
> execute a new lucene query that does not have the subject/entity bound on 
> every iteration and then iterate through the lucene results to join against 
> the subject.  This is quite inefficient.
> Example query:
> {code}
> select *
> where {
>   ?s rdf:type <http://example.org/Entity> .
>   ?s text:query ( rdfs:label "test" ) .
> }
> {code}
> This would be quite slow if there were a lot of entities in the system.
> Two potential solutions present themselves:
> # Craft a more explicit lucene query that specifies the entity URI, so that 
> the results coming back from lucene are much smaller.  However, this would 
> cause problems with the score not being correct across multiple iterations.  
> Additionally we are still potentially running a lot of lucene queries, each 
> of which has a probably non-negligble constant cost (parsing the query 
> string, etc).
> # Execute the more general lucene query the first time it is encountered, 
> then caching the results somewhere.  From there, we can then perform a hash 
> table lookup against those cached results.
> I would like to pursue option 2, but there is a problem.  Because jena-text 
> is implemented as a property function instead of a query op in and of itself 
> (like QueryIterMinus is for example), we have to find a place to stash the 
> lucene results.  I believe this can be done by placing it in the 
> ExecutionContext object, using the lucene query as a cache key.  Updates 
> provide a slightly troubling case because you could have an update request 
> like:
> {code}
> insert data { <urn:test1> rdf:type <http://example.org/Entity> ; rdfs:label 
> "test" } ;
> delete { ?s ?p ?o }
> where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label 
> "test" ) . ?p ?o . } ;
> insert data { <urn:test2> rdf:type <http://example.org/Entity> ; rdfs:label 
> "test" } ;
> delete { ?s ?p ?o }
> where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label 
> "test" ) ; ?p ?o . }
> {code}
> And then the end result should be an empty database.  But if the 
> ExecutionContext was the same for both delete queries, you would be using the 
> cached results from the first delete query in the second delete query, which 
> would result in {{<urn:test2>}} not being deleted properly.
> If the ExecutionContext is indeed shared between the two update queries in 
> the situation above, I think this can be solved by making the cache key for 
> the lucene resultset be a combination of both the lucene query and the 
> QueryIterRoot or BindingRoot.  I need to investigate this.  An alternative, 
> if there was a way to be notified when a query has finished executing, we 
> could clear the cache in the ExecutionContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JENA-999) Poor jena-text query performance when a bound subject is used

Reply via email to