The results are to large to keep in memory, so I would like to page
them using LIMIT and OFFSET. However it does not work with the above
query. The query above needs all results to be loaded into memory
when evaluating it. I assume this is because more than one statement
is evaluated in the WHERE clause(?).
That's not why: it's because you're imposing an order with ORDER BY.
There are (broadly speaking) two ways this query could be executed.
If a store has an index on my:hasUserID (and that index happens to be
in SPARQL's defined order!) then results can be generated in ordered
sequence. Successive pages can be generated by re-running the query,
skipping more and more results, or somehow holding on to a cursor.
It's not enough to just skip userIDs: *rows* must be skipped, so the
query does have to be executed in order to skip to the right point.
If a store does not have such an index, or your ORDER BY clause is
more complicated, then all the results must be gathered in memory to
be sorted. There's really no way around that.
For a store that doesn't maintain state between queries, generating
successive pages in this manner will essentially involve running the
whole query each time, returning a different chunk of the results. If
you have to sort 100,000 result rows in order to determine the first
1,000, then the second 1,000, you're going to see pretty poor
performance.
Each query execution will reflect any changes in the store since the
last page was generated, which can produce confusing results.
So, how could I page the above query?
Do it in your application. That way you also avoid the data changing
between pages.
I don't think that LIMIT and OFFSET are useful for supporting paging,
because the spec does not mandate sufficient efficiency constraints on
implementations (such as cursors, as provided by Freebase MQL
queries). It's odd to say "you could do it using the method the spec
recommends, but you'd be crazy to do so with real datasets". I
consider LIMIT's only real use to be for constraining the size of the
result set, not defining a page size.
IMO it would be much more useful to separate SPARQL execution into two
phases: a query that returns a result set, and then operations on the
result set (such as serializing slices of it). Conflating the two
places the burden of doing paging efficiently onto implementation, and
there's no one good solution for all clients.
-R