[jira] [Comment Edited] (JENA-999) Poor jena-text query performance when a bound subject is used

Andy Seaborne (JIRA) Thu, 06 Aug 2015 04:32:27 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659850#comment-14659850
 ]


Andy Seaborne edited comment on JENA-999 at 8/6/15 11:31 AM:
-------------------------------------------------------------

Deeply nested {{TextQueryPF}} may be called several times.  Property functions 
have to fit into the overall way that execution happens.  That may mean that 
that some parts are not optimal for all property functions.  

{{build()}} is called by {{OpExecutor.execute(OpPropFunc, QueryIterator)}} once 
but as with any execution, {{OpExecutor.execute(OpPropFunc)}} may be called 
repeatedly.  That is the way ARQ works. {{build()}} should be quite cheap or at 
least can be made so for common cases.  This allows for PF being quite dynamic. 
 From your figures, it seemed that the main cost in TextQueryPF is overhead 
going to Lucene or Solr.

I though the idea was to calculation in {{exec(QueryIterator)}}.  (NB: There 
can be several {{TextQueryPF}} in one query which share Context.  The Context 
is considered immutable for an execution. ExecutionContext changes but also it 
may be different object in some parts fo the query.)

I don't have time to investigate in great detail but it may be possible to call 
{{build()}} as part of transformation.  c.f. {{OpVisitorExprPrepare}} for 
filters.  Maybe in TransformPropertyFunction or a later Transform in the 
optimization pipeline.

All other property functions need checking that this is safe but it is quite 
likely they are (from memory).  Not in TransformPropertyFunction - but as a 
similar Transform 

Property functions have to fit into the overall way that execution happens.  
That may mean that that some parts are not optimal for all property functions.  



was (Author: andy.seaborne):
Example query?

{{build()}} is called by {{OpExecutor.execute(OpPropFunc, QueryIterator)}} once 
but as with any execution, {{OpExecutor.execute(OpPropFunc)}} may be called 
repeatedly.  That is the way ARQ works.

I though the idea was to calculation in {{exec(QueryIterator)}}.  There can be 
several {{TextQueryPF}} in one query which share Context.

{{build()}} should be quite cheap or at least can be made so for common cases.  
This allows for PF being quite dynamic.  From your figures, it seemed that the 
main cost in TextQueryPF is overhead going to Lucene or Solr.

I don't have time to investigate in great detail but it may be possible to call 
{{build()}} as part of transformation.  c.f. {{OpVisitorExprPrepare}} for 
filters.  Maybe in TransformPropertyFunction or a later Transform in the 
optimization pipeline.

All other property functions need checking that this is safe but it is quite 
likely they are (from memory).  Not in TransformPropertyFunction - but as a 
similar Transform 

Property functions have to fit into the overall way that execution happens.  
That may mean that that some parts are not optimal for all property functions.  




> Poor jena-text query performance when a bound subject is used
> -------------------------------------------------------------
>
>                 Key: JENA-999
>                 URL: https://issues.apache.org/jira/browse/JENA-999
>             Project: Apache Jena
>          Issue Type: Improvement
>            Reporter: Stephen Allen
>            Assignee: Stephen Allen
>            Priority: Minor
>         Attachments: PerformanceTester.java, jena-text benchmarks.png
>
>
> When executing a jena-text query, the performance is terrible if the subject 
> is already bound to a variable.  This is because the current code will 
> execute a new lucene query that does not have the subject/entity bound on 
> every iteration and then iterate through the lucene results to join against 
> the subject.  This is quite inefficient.
> Example query:
> {code}
> select *
> where {
>   ?s rdf:type <http://example.org/Entity> .
>   ?s text:query ( rdfs:label "test" ) .
> }
> {code}
> This would be quite slow if there were a lot of entities in the system.
> Two potential solutions present themselves:
> # Craft a more explicit lucene query that specifies the entity URI, so that 
> the results coming back from lucene are much smaller.  However, this would 
> cause problems with the score not being correct across multiple iterations.  
> Additionally we are still potentially running a lot of lucene queries, each 
> of which has a probably non-negligble constant cost (parsing the query 
> string, etc).
> # Execute the more general lucene query the first time it is encountered, 
> then caching the results somewhere.  From there, we can then perform a hash 
> table lookup against those cached results.
> I would like to pursue option 2, but there is a problem.  Because jena-text 
> is implemented as a property function instead of a query op in and of itself 
> (like QueryIterMinus is for example), we have to find a place to stash the 
> lucene results.  I believe this can be done by placing it in the 
> ExecutionContext object, using the lucene query as a cache key.  Updates 
> provide a slightly troubling case because you could have an update request 
> like:
> {code}
> insert data { <urn:test1> rdf:type <http://example.org/Entity> ; rdfs:label 
> "test" } ;
> delete { ?s ?p ?o }
> where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label 
> "test" ) . ?p ?o . } ;
> insert data { <urn:test2> rdf:type <http://example.org/Entity> ; rdfs:label 
> "test" } ;
> delete { ?s ?p ?o }
> where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label 
> "test" ) ; ?p ?o . }
> {code}
> And then the end result should be an empty database.  But if the 
> ExecutionContext was the same for both delete queries, you would be using the 
> cached results from the first delete query in the second delete query, which 
> would result in {{<urn:test2>}} not being deleted properly.
> If the ExecutionContext is indeed shared between the two update queries in 
> the situation above, I think this can be solved by making the cache key for 
> the lucene resultset be a combination of both the lucene query and the 
> QueryIterRoot or BindingRoot.  I need to investigate this.  An alternative, 
> if there was a way to be notified when a query has finished executing, we 
> could clear the cache in the ExecutionContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (JENA-999) Poor jena-text query performance when a bound subject is used

Reply via email to