Re: Poor performance with OFFSET

Rob Vesse Tue, 06 Aug 2013 09:37:50 -0700

No you can't create a index that pre-computes the sorting of some
particular facet of your data.


Btw if you are using TDB then there shouldn't be any need to use an ORDER
BY to get paging since TDB will for any given query run multiple times
always return the data in a predictable fashion due to how it does index
scans and reads.  Obviously if you want a specific ordering then you need
to add an ORDER BY

Rob



On 8/6/13 8:10 AM, "Olivier Rossel" <olivier.ros...@gmail.com> wrote:

>Using LIMIT/OFFSET sounds like paging, to me.
>And a usual requirement of paging is to order data before paging.
>Ordering can be for example "order alphabetically mmx:Objects by their
>label".
>So the question is: can I create an index that precomputes "the sorting of
>labels of mmx:Objects".
>
>Andy s answer is pretty interesting (and more general).
>Tristan, the link you provide sounds interesting too. But I am not sure
>how
>it helps for the creation of the index I described above.
>
>
>
>
>On Tue, Aug 6, 2013 at 4:59 PM, Tristan Roddis <trist...@cogapp.com>
>wrote:
>
>> On 05/08/2013 21:02, Olivier Rossel wrote:
>>
>>> Please note that such a query does not sort mmx:Objects in the inner
>>> select.
>>> I have no idea of the performances drop if you add that requirement.
>>>
>> Using my data: quite a lot. For a sample offset of 50000, it increases
>> from 380ms without, to 1600ms with ordering.
>>
>>  BTW, is it possible to create an index on (for example) "rdfs:label of
>>> mmx:Objects"?
>>>
>> I'm pretty new to Jena and I'm afraid I'm not sure what you mean! Do you
>> mean adding a full-text search index as described at
>> 
>>http://jena.apache.org/**documentation/query/text-**query.html<http://jen
>>a.apache.org/documentation/query/text-query.html>or something else?
>>
>> Thanks,
>>
>> -Tristan.
>>
>>
>>  On Mon, Aug 5, 2013 at 6:40 PM, Rob Vesse <rve...@yarcdata.com> wrote:
>>
>>> Note that OFFSET is implemented internally by having to iterate over
>>>all
>>>> the unnecessary solutions first.
>>>>
>>>> So if you have OFFSET 10000 then the query engine has to generate and
>>>> discard 10,000 possible solutions before it starts returning actual
>>>> solutions.  As OFFSET increases the time this takes increases.
>>>>
>>>> There is some degree of optimization for these kinds of queries in
>>>>ARQs
>>>> optimizer but it doesn't apply in all cases.
>>>>
>>>> Rob
>>>>
>>>>
>>>>
>>>> On 8/5/13 9:21 AM, "Tristan Roddis" <trist...@cogapp.com> wrote:
>>>>
>>>>  Thanks, Olivier: that did the trick!
>>>>>
>>>>> I tweaked it slightly as not all objects have a rdfs:label populated,
>>>>> and I need to keep that restriction. But using this query the maximum
>>>>> execution time drops to around 700ms:
>>>>>
>>>>> SELECT DISTINCT *  WHERE {
>>>>> { SELECT ?object ?label ?accessionNumber WHERE {?object a mmx:Object
>>>>>.
>>>>> ?object  rdfs:label ?label . ?object mmx:accessionNumber
>>>>> ?accessionNumber } LIMIT 12 OFFSET 20000 } .
>>>>> OPTIONAL {
>>>>> ?object mmx:preferredAsset ?asset .
>>>>> ?asset a mmx:Asset .
>>>>> ?asset dcterms:hasVersion ?image .
>>>>> ?image mmx:role 'thumbnail' .
>>>>> }
>>>>> }
>>>>>
>>>>> I also now realise that the lack of rdfs:label (and hence results)
>>>>>was
>>>>> the reason for the plateau. If I use your query syntax instead, I no
>>>>> longer see a plateau at 4000s for invalidly high offsets, but rather
>>>>>a
>>>>> speedy (zero result count) response in around 100ms.
>>>>>
>>>>> Thanks again,
>>>>>
>>>>> -Tristan.
>>>>>
>>>>> On 05/08/2013 15:00, Olivier Rossel wrote:
>>>>>
>>>>>> Could you please try a query like this:
>>>>>> SELECT DISTINCT *  WHERE {
>>>>>> { SELECT ?object WHERE {?object a mmx:Object } LIMIT 12 OFFSET
>>>>>>10000 }
>>>>>> .
>>>>>> ?object mmx:accessionNumber ?accessionNumber .
>>>>>> ?object  rdfs:label ?label .
>>>>>> OPTIONAL {
>>>>>> ?object mmx:preferredAsset ?asset .
>>>>>> ?asset a mmx:Asset .
>>>>>> ?asset dcterms:hasVersion ?image .
>>>>>> ?image mmx:role 'thumbnail' .
>>>>>> }
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 5, 2013 at 3:47 PM, Tristan Roddis <trist...@cogapp.com>
>>>>>> wrote:
>>>>>>
>>>>>>  We have a data set with around 160,000 triples corresponding to an
>>>>>>> "mmx:Object", and I am executing the following query to select data
>>>>>>> from
>>>>>>> these at various offsets:
>>>>>>>
>>>>>>> SELECT DISTINCT *  WHERE {
>>>>>>> ?object a mmx:Object .
>>>>>>> ?object mmx:accessionNumber ?accessionNumber .
>>>>>>> ?object  rdfs:label ?label .
>>>>>>> OPTIONAL {
>>>>>>> ?object mmx:preferredAsset ?asset .
>>>>>>> ?asset a mmx:Asset .
>>>>>>> ?asset dcterms:hasVersion ?image .
>>>>>>> ?image mmx:role 'thumbnail' .
>>>>>>> }
>>>>>>> } LIMIT 12 OFFSET 10000
>>>>>>>
>>>>>>> However, what I have found is that as I increase the value of this
>>>>>>> offset,
>>>>>>> the execution time goes up dramatically: from tens of milliseconds
>>>>>>> when the
>>>>>>> offset is around 1000, to over 4,000 milliseconds if the offset is
>>>>>>> increased to 150,000. (see sample data below, or a graph at
>>>>>>>
>>>>>>> 
>>>>>>>http://picpaste.com/jena_****offsets.png<http://picpaste.com/jena_**
>>>>>>>offsets.png>
>>>>>>> 
>>>>>>><http://picpaste.**com/jena_offsets<http://picpaste.com/jena_offsets
>>>>>>>>
>>>>>>>
>>>>>> .
>>>>
>>>>> png>)
>>>>>>>
>>>>>>> I tried generating a stats.opt file, and also using none.opt but
>>>>>>>this
>>>>>>> behaviour seems to happen regardless: the execution time becomes
>>>>>>> extremely
>>>>>>> high very quickly, and then plateaus at offsets greater than
>>>>>>>100,000.
>>>>>>>
>>>>>>> Does anyone know any way to get better performance from Jena for
>>>>>>> offset
>>>>>>> queries such as this?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>>
>>>>>>> -Tristan.
>>>>>>>
>>>>>>> Sample data below. Results obtained using Apache Benchmark on
>>>>>>>another
>>>>>>> machine on the same network segment. Dev server is a Linux machine
>>>>>>> with 3Gb
>>>>>>> total RAM, of which 2Gb allocated to the JVM. Production server is
>>>>>>>a
>>>>>>> quad-core Windows server with 16Gb RAM, of which 8 Gb allocated to
>>>>>>>the
>>>>>>> JVM.
>>>>>>>
>>>>>>> Offset  Dev (ms)
>>>>>>>           Prod (ms)
>>>>>>> 10      8       11
>>>>>>> 100     15      17
>>>>>>> 1000    84      90
>>>>>>> 5000    260     460
>>>>>>> 10000   463     701
>>>>>>> 20000   811     1070
>>>>>>> 30000   1402    1502
>>>>>>> 40000   1844    2122
>>>>>>> 50000   2515    2695
>>>>>>> 60000   2960    3290
>>>>>>> 80000   3686    4190
>>>>>>> 100000  3840    4470
>>>>>>> 120000  3991    4416
>>>>>>> 150000  3865    4535
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> cogapp
>>>>>>>
>>>>>>> Tristan Roddis
>>>>>>> Head of Web Development
>>>>>>> +44 1273 821 600
>>>>>>> www.cogapp.com <http://www.cogapp.com>
>>>>>>>
>>>>>>>
>>>>>>> *News*
>>>>>>> EnoughFoodIf.org <http://enoughfoodif.org>
>>>>>>> Digital programme led by a core group of UK charities, design by
>>>>>>> Lavish,
>>>>>>> technical architecture and implementation by Cogapp.
>>>>>>> For regular Cogapp news, please follow us on Twitter <
>>>>>>> https://twitter.com/cogapp>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --
>>>>> cogapp
>>>>>
>>>>> Tristan Roddis
>>>>> Head of Web Development
>>>>> +44 1273 821 600
>>>>> www.cogapp.com <http://www.cogapp.com>
>>>>>
>>>>>
>>>>> *News*
>>>>> EnoughFoodIf.org <http://enoughfoodif.org>
>>>>> Digital programme led by a core group of UK charities, design by
>>>>>Lavish,
>>>>> technical architecture and implementation by Cogapp.
>>>>> For regular Cogapp news, please follow us on Twitter
>>>>> <https://twitter.com/cogapp>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>> --
>> cogapp
>>
>> Tristan Roddis
>> Head of Web Development
>> +44 1273 821 600
>> www.cogapp.com <http://www.cogapp.com>
>>
>>
>> *News*
>> EnoughFoodIf.org <http://enoughfoodif.org>
>> Digital programme led by a core group of UK charities, design by Lavish,
>> technical architecture and implementation by Cogapp.
>> For regular Cogapp news, please follow us on Twitter <
>> https://twitter.com/cogapp>
>>
>>
>>
>>
>>

Re: Poor performance with OFFSET

Reply via email to