[
https://issues.apache.org/jira/browse/JENA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephen Allen updated JENA-999:
-------------------------------
Attachment: jena-text benchmarks.png
I've attached a chart (note log-log scale) of some performance benchmarks.
This is an in-memory TDB database with an in-memory lucene index. There are
10,000 entities ("test0" to "test9999") in the system. There are two
statements per entity:
{code}
<http://example.org/test1234> rdf:type <http://example.org/Entity> .
<http://example.org/test1234> rdfs:label "test1234" .
{code}
Jena-text indexes the rdfs:label predicate. The benchmarks show 5 different
implementations for the following query:
{code}
select *
where {
?s rdf:type <http://example.org/Entity>
# lucene query string below varied to adjust the number of results
?s ?s text:query ( rdfs:label "test1*" )
}
{code}
The 5 implementations are as follows:
# *Original* - the current jena-text implementation with a concrete subject
# *Lucene Index Join* - Option 1 described above. Added the subject as a
parameter to the lucene query
# *Hash Join* - Option 2 described above. Perform the lucene query for all
results, which are then cached in the ExecutionContext in a hashmap
# *Non-Indexed* - The query above is modified to remove the {{text:query}}
statement and instead simply retrieve the label from the RDF store and use
{{strstarts()}} to filter the results
# *Text Search First* - The current jena-text implemenation, but the query
above is modified by switching the order of the two BGPs. This is the unbound
subject case.
>From these initial results, it seems worthwhile to try to implement a version
>of option 2 (the hash join). This seems quite doable without the ugly hack of
>storing stuff in the ExecutionContext by doing as you suggest and implementing
>{{PropertyFunction}} ourselves instead of extending {{PropertyFunctionBase}}.
> Poor jena-text query performance when a bound subject is used
> -------------------------------------------------------------
>
> Key: JENA-999
> URL: https://issues.apache.org/jira/browse/JENA-999
> Project: Apache Jena
> Issue Type: Improvement
> Reporter: Stephen Allen
> Assignee: Stephen Allen
> Priority: Minor
> Attachments: jena-text benchmarks.png
>
>
> When executing a jena-text query, the performance is terrible if the subject
> is already bound to a variable. This is because the current code will
> execute a new lucene query that does not have the subject/entity bound on
> every iteration and then iterate through the lucene results to join against
> the subject. This is quite inefficient.
> Example query:
> {code}
> select *
> where {
> ?s rdf:type <http://example.org/Entity> .
> ?s text:query ( rdfs:label "test" ) .
> }
> {code}
> This would be quite slow if there were a lot of entities in the system.
> Two potential solutions present themselves:
> # Craft a more explicit lucene query that specifies the entity URI, so that
> the results coming back from lucene are much smaller. However, this would
> cause problems with the score not being correct across multiple iterations.
> Additionally we are still potentially running a lot of lucene queries, each
> of which has a probably non-negligble constant cost (parsing the query
> string, etc).
> # Execute the more general lucene query the first time it is encountered,
> then caching the results somewhere. From there, we can then perform a hash
> table lookup against those cached results.
> I would like to pursue option 2, but there is a problem. Because jena-text
> is implemented as a property function instead of a query op in and of itself
> (like QueryIterMinus is for example), we have to find a place to stash the
> lucene results. I believe this can be done by placing it in the
> ExecutionContext object, using the lucene query as a cache key. Updates
> provide a slightly troubling case because you could have an update request
> like:
> {code}
> insert data { <urn:test1> rdf:type <http://example.org/Entity> ; rdfs:label
> "test" } ;
> delete { ?s ?p ?o }
> where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label
> "test" ) . ?p ?o . } ;
> insert data { <urn:test2> rdf:type <http://example.org/Entity> ; rdfs:label
> "test" } ;
> delete { ?s ?p ?o }
> where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label
> "test" ) ; ?p ?o . }
> {code}
> And then the end result should be an empty database. But if the
> ExecutionContext was the same for both delete queries, you would be using the
> cached results from the first delete query in the second delete query, which
> would result in {{<urn:test2>}} not being deleted properly.
> If the ExecutionContext is indeed shared between the two update queries in
> the situation above, I think this can be solved by making the cache key for
> the lucene resultset be a combination of both the lucene query and the
> QueryIterRoot or BindingRoot. I need to investigate this. An alternative,
> if there was a way to be notified when a query has finished executing, we
> could clear the cache in the ExecutionContext.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)