[jira] [Commented] (JENA-999) Poor jena-text query performance when a bound subject is used

ASF GitHub Bot (JIRA) Tue, 05 Jan 2016 14:01:04 -0800

    [ 
https://issues.apache.org/jira/browse/JENA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083918#comment-15083918
 ]


ASF GitHub Bot commented on JENA-999:
-------------------------------------

Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/119#discussion_r48902208
  
    --- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextQueryPF.java ---
    @@ -268,7 +276,25 @@ private QueryIterator concreteSubject(Binding binding, 
Node s, Node score, Node
             Explain.explain(execCxt.getContext(), "Text query: "+queryString) ;
             if ( log.isDebugEnabled())
                 log.debug("Text query: {} ({})", queryString,limit) ;
    -        return textIndex.query(property, queryString, limit) ;
    +
    +        String cacheKey = limit + " " + property + " " + queryString ;
    +        Map<String,ListMultimap<String,TextHit>> queryCache = 
    +            (Map<String,ListMultimap<String,TextHit>>) 
execCxt.getContext().get(cacheSymbol);
    +        if (queryCache == null) { /* doesn't yet exist, need to create it 
*/
    +            queryCache = new LinkedHashMap();
    +            execCxt.getContext().put(cacheSymbol, queryCache);
    +        }
    +
    +        ListMultimap<String,TextHit> results = queryCache.get(cacheKey) ;
    +        if (results == null) { /* cache miss */
    --- End diff --
    
    Because you want to cache the result if it isn't already cached, maybe 
`queryCache.asMap()::computeIfAbsent` could be useful here? Just a thought, 
maybe it's not more clear.


> Poor jena-text query performance when a bound subject is used
> -------------------------------------------------------------
>
>                 Key: JENA-999
>                 URL: https://issues.apache.org/jira/browse/JENA-999
>             Project: Apache Jena
>          Issue Type: Improvement
>            Reporter: Stephen Allen
>            Assignee: Stephen Allen
>            Priority: Minor
>         Attachments: PerformanceTester.java, jena-text benchmarks.png
>
>
> When executing a jena-text query, the performance is terrible if the subject 
> is already bound to a variable.  This is because the current code will 
> execute a new lucene query that does not have the subject/entity bound on 
> every iteration and then iterate through the lucene results to join against 
> the subject.  This is quite inefficient.
> Example query:
> {code}
> select *
> where {
>   ?s rdf:type <http://example.org/Entity> .
>   ?s text:query ( rdfs:label "test" ) .
> }
> {code}
> This would be quite slow if there were a lot of entities in the system.
> Two potential solutions present themselves:
> # Craft a more explicit lucene query that specifies the entity URI, so that 
> the results coming back from lucene are much smaller.  However, this would 
> cause problems with the score not being correct across multiple iterations.  
> Additionally we are still potentially running a lot of lucene queries, each 
> of which has a probably non-negligble constant cost (parsing the query 
> string, etc).
> # Execute the more general lucene query the first time it is encountered, 
> then caching the results somewhere.  From there, we can then perform a hash 
> table lookup against those cached results.
> I would like to pursue option 2, but there is a problem.  Because jena-text 
> is implemented as a property function instead of a query op in and of itself 
> (like QueryIterMinus is for example), we have to find a place to stash the 
> lucene results.  I believe this can be done by placing it in the 
> ExecutionContext object, using the lucene query as a cache key.  Updates 
> provide a slightly troubling case because you could have an update request 
> like:
> {code}
> insert data { <urn:test1> rdf:type <http://example.org/Entity> ; rdfs:label 
> "test" } ;
> delete { ?s ?p ?o }
> where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label 
> "test" ) . ?p ?o . } ;
> insert data { <urn:test2> rdf:type <http://example.org/Entity> ; rdfs:label 
> "test" } ;
> delete { ?s ?p ?o }
> where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label 
> "test" ) ; ?p ?o . }
> {code}
> And then the end result should be an empty database.  But if the 
> ExecutionContext was the same for both delete queries, you would be using the 
> cached results from the first delete query in the second delete query, which 
> would result in {{<urn:test2>}} not being deleted properly.
> If the ExecutionContext is indeed shared between the two update queries in 
> the situation above, I think this can be solved by making the cache key for 
> the lucene resultset be a combination of both the lucene query and the 
> QueryIterRoot or BindingRoot.  I need to investigate this.  An alternative, 
> if there was a way to be notified when a query has finished executing, we 
> could clear the cache in the ExecutionContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JENA-999) Poor jena-text query performance when a bound subject is used

Reply via email to