[jira] [Commented] (JENA-1918) Bad performance when using ORDER BY

Rob Vesse (Jira) Wed, 17 Jun 2020 02:54:21 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138299#comment-17138299
 ]


Rob Vesse commented on JENA-1918:
---------------------------------

So your problem is that in order to satisfy {{ORDER BY}} Jena has to accumulate 
all possible results and sort them.  I think with that query Jena will try and 
use a Top N optimisation where it only stores at most N items, comparing them 
as it accumulates possible solutions.  Algebra for your simplified query:

{noformat}
> qparse --opt --file ~/Downloads/temp/JENA-1918.rq 
(prefix ((p: <http://www.wikidata.org/prop/>)
         (pq: <http://www.wikidata.org/prop/qualifier/>)
         (ps: <http://www.wikidata.org/prop/statement/>)
         (wdt: <http://www.wikidata.org/prop/direct/>)
         (wikibase: <http://wikiba.se/ontology#>)
         (wd: <http://www.wikidata.org/entity/>))
  (project (?item)
    (top (1 ?item)
      (sequence
        (bgp (triple ?item wdt:P31 ??P1))
        (path ??P1 (path* wdt:P279) wd:Q23397)))))
{noformat}

And for the more complex version:

{noformat}
(prefix ((p: <http://www.wikidata.org/prop/>)
         (pq: <http://www.wikidata.org/prop/qualifier/>)
         (ps: <http://www.wikidata.org/prop/statement/>)
         (wdt: <http://www.wikidata.org/prop/direct/>)
         (wikibase: <http://wikiba.se/ontology#>)
         (wd: <http://www.wikidata.org/entity/>))
  (project (?item ?outflow ?drainageBasin ?coordinates ?elevation ?country)
    (top (1 ?item)
      (conditional
        (conditional
          (conditional
            (conditional
              (conditional
                (sequence
                  (bgp (triple ?item wdt:P31 ??P1))
                  (path ??P1 (path* wdt:P279) wd:Q23397))
                (bgp (triple ?item wdt:P201 ?outflow)))
              (bgp (triple ?item wdt:P4614 ?drainageBasin)))
            (bgp (triple ?item wdt:P625 ?coordinates)))
          (bgp (triple ?item wdt:P2044 ?elevation)))
        (bgp (triple ?item wdt:P17 ?country))))))
{noformat}

So in both cases it is using the {{top}} operator to only accumulate at most N 
items (in your case N=1) but this still requires it to evaluate every possible 
result because Jena does not store any ordering information in the way it 
stores the data.

I suspect Blazegraph is doing some combination of the following to achieve the 
performance you see:

* All terms are encoded using an encoding scheme that preserves ordering of 
terms allowing it to trivially find the N items needed to satisfy your query
* Applying a query caching layer so previously seen queries can be served from 
previously cached results
* Holding everything in-memory to remove any IO overheads (TDB is memory mapped 
files but ultimately disk backed)

> Bad performance when using ORDER BY
> -----------------------------------
>
>                 Key: JENA-1918
>                 URL: https://issues.apache.org/jira/browse/JENA-1918
>             Project: Apache Jena
>          Issue Type: Bug
>          Components: Jena
>    Affects Versions: Jena 3.15.0
>            Reporter: Jonas Sourlier
>            Priority: Major
>
> I want to execute the following SPARQL against my local Apache Jena (with 
> preloaded Wikidata dump using TDB2):
> {code:java}
> PREFIX wd: <http://www.wikidata.org/entity/>
> PREFIX wdt: <http://www.wikidata.org/prop/direct/>
> PREFIX wikibase: <http://wikiba.se/ontology#>
> PREFIX p: <http://www.wikidata.org/prop/>
> PREFIX ps: <http://www.wikidata.org/prop/statement/>
> PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
> SELECT ?item ?outflow ?drainageBasin ?coordinates ?elevation ?country
>  
>  WHERE {
>  ?item wdt:P31/wdt:P279* wd:Q23397.
>  
>  OPTIONAL { ?item wdt:P201 ?outflow. }
>  OPTIONAL { ?item wdt:P4614 ?drainageBasin. }
>  OPTIONAL { ?item wdt:P625 ?coordinates. }
>  OPTIONAL { ?item wdt:P2044 ?elevation. }
>  OPTIONAL { ?item wdt:P17 ?country. }
>  }
>  
>  ORDER BY ?item LIMIT 1 OFFSET 0
> {code}
> When run on query.wikidata.org (which uses Blazegraph), this query takes 26 
> seconds to complete. Other queries run in about the same time as on 
> query.wikidata.org.
> Apache Jena runs for several hours, using one CPU core and 3-4 GB of memory. 
> Then it runs into some timeout (the timeout might be increased, but that's 
> not the issue here).
> My question is, why is this so much slower than Blazegraph? Can this SPARQL 
> be optimized to get a better performance? Can the query optimizer be tweaked 
> to run this more efficiently?
> If not, then I consider this a bug, because the query itself should not 
> generate such a big workload. If the query optimizer runs the
> {code:java}
> wdt:P31/wdt:P279*{code}
> predicate first, then limits it via the
> {code:java}
> ORDER BY ?item LIMIT 1 OFFSET 0{code}
> clause, there would be just one item for which it needs to execute the
> {code:java}
> OPTIONAL { ?item ... }{code}
> joins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (JENA-1918) Bad performance when using ORDER BY

Reply via email to