[arangodb-google] Re: Performance question: AQL COLLECT

Manu Mon, 23 May 2016 13:43:10 -0700

Hi,

yes I have enough memory available. I came up with this query



let res = (for doc in test collect status = doc.status with count into g

return {status, g})


return {

allN:sum(for r in res return r.g)
, res:res
}

With the current devel build (f196428920a95b89ad21737446b9b18697e11182) the 
query execs in 1.3 secs with 2.9 M docs.
Just for fun I removed the skiplist index for status and the query execs 
1.5 secs. This one was unexspected.


Am Montag, 25. April 2016 16:41:21 UTC+2 schrieb Wilfried Gösgens:
>
> Hi, 
> yes, you're right. 
> The index doesn't have such statistics.
> As mentioned above, the COLLECT has to do string comparison for each item.
> By actualy providing it with the Buckets, it can lookup these directly via 
> the index.
> for this query:
>
> for status in [200,301,302,404,500,503]
> return length(for doc in users
> filter doc.status == status
> return doc)
>
> you could probably increase the speed once more by not returning the 
> document (return doc) but simply 1 (return 1) since you don't use the 
> actual document anyways.
> You would probably want to run it like that:
>
> for status in [200,301,302,404,500,503]
>   return {statusOf: status, count: length(
>       for doc in users filter doc.status == status return 1)}
>
> To understand the actualy performance difference, one should inspect the 
> results of explain; One reason could be thath '[status]' uses V8 to execute:
>
> db._explain('for status in [200,301,302,404,500,503] return {[status] : 
> length(for doc in test filter doc.status == status return 1)} ')
> Query string:
>  for status in [200,301,302,404,500,503] return {[status] : length(for doc 
> in test filter doc.status 
>  == status return 1)} 
>
> Execution plan:
>  Id   NodeType            Est.   Comment
>   1   SingletonNode          1   * ROOT
>   2   CalculationNode        1     - LET #4 = [ 200, 301, 302, 404, 500, 
> 503 ]   /* json expression */   /* const assignment */
>   3   EnumerateListNode      6     - FOR status IN #4   /* list iteration 
> */
>  10   SubqueryNode           6       - LET #2 = ...   /* subquery */
>   4   SingletonNode          1         * ROOT
>   8   CalculationNode        1           - LET #8 = 1   /* json expression 
> */   /* const assignment */
>  13   IndexNode              1           - FOR doc IN test   /* skiplist 
> index scan */
>   9   ReturnNode             1             - RETURN #8
>  11   CalculationNode        6       - LET #10 = { [ status ] : LENGTH(#2) 
> }   /* v8 expression */
>  12   ReturnNode             6       - RETURN #10
>
> Indexes used:
>  By   Type       Collection   Unique   Sparse   Selectivity   
> Fields         Ranges
>  13   skiplist   test         false    false            n/a   [ `status` 
> ]   (doc.`status` == status)
>
> However, at least on our linux boxes the difference isn't as huge as for 
> you. What host do you use? Does it have enough memory available?
>
> Cheers, 
> Willi
> On Thursday, April 21, 2016 at 1:01:10 PM UTC+2, Manu wrote:
>>
>> Hi,
>>
>> thanks for the explanation.
>>
>> Am I right in assuming that the skiplist index has no precomputed groups 
>> or stats about groups?
>>
>> When I do a AQL query 
>>
>> RETURN LENGTH(FOR doc IN docs
>> FILTER doc.status == N
>> RETURN doc)
>>
>> Then the sum for every runtime is about 700ms.
>>
>> More interesting about 2,5s difference between
>>
>> return { [status]: length(AQL query)} and return {[status]: (prev. AQL 
>> query result length)} :
>>
>> // query
>> // ~789ms
>> for status in [200,301,302,404,500,503]
>> return length(for doc in users
>> filter doc.status == status
>> return doc)
>>
>> // query
>> // ~3381ms
>> for status in [200,301,302,404,500,503]
>> return {[status]:length(for doc in users
>> filter doc.status == status
>> return doc)}
>>
>> this Query will also run ~700ms
>> // query
>> // ~789ms
>> let stati = [200,301,302,404,500,503]
>> let res = (for status in stati
>> return length(for doc in users
>> filter doc.status == status
>> return doc))
>> for status in stati
>> return { [status]: res[ POSITION(stati, status, true)   ]}
>>
>>
>> Am Donnerstag, 21. April 2016 10:21:42 UTC+2 schrieb Wilfried Gösgens:
>>>
>>> Hi, 
>>> lets have a look at what this query has to do behind the curtains:
>>>
>>> - it has to look at all documents
>>> - since the index is used, it gets values of one group at a time
>>> - it has to compare the last and the next value to detect a group change
>>>
>>> If it wouldn't have the index, it would have to do more compares to sort 
>>> the values into their respective buckets.
>>> Since your value distribution is not to big, even without the index you 
>>> get a fair chance to hit a next element belonging to the same group, so the 
>>> gains aren't that huge:
>>> Testing it at my machine made the collect in double time without index.
>>>
>>> With a broader distribution of the values you would probably gain more 
>>> with the index, but still the comparisons is whats expensive here, and I 
>>> think the 1.7s aren't bad for that.
>>>
>>> Cheers, 
>>> Willi
>>>
>>> On Wednesday, April 20, 2016 at 11:28:13 AM UTC+2, Manuel wrote:
>>>>
>>>> I have ~1M docs
>>>> Example:
>>>> { status:404}, {status:500}
>>>>
>>>> A skiplist index is applied at status
>>>>
>>>> when I do the following AQL query:
>>>>
>>>> FOR doc IN docs
>>>>
>>>> COLLECT status = doc.status WITH COUNT INTO g
>>>>
>>>> RETURN {[status]: g}
>>>>
>>>>
>>>> it takes around 1,7s to get the result.
>>>>
>>>> My understanding is that this AQL query just returns the skiplist stats.
>>>>
>>>> So could this query be optimized in this way?
>>>>
>>>> // some pseudocode to express my thought
>>>> result = {}
>>>> for(list in skiplist) {
>>>> result[list.name] = list.count;
>>>> }
>>>> return [result]
>>>>
>>>> kind regards
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[arangodb-google] Re: Performance question: AQL COLLECT

Reply via email to