Hi,
yes I have enough memory available. I came up with this query
let res = (for doc in test collect status = doc.status with count into g
return {status, g})
return {
allN:sum(for r in res return r.g)
, res:res
}
With the current devel build (f196428920a95b89ad21737446b9b18697e11182) the
query execs in 1.3 secs with 2.9 M docs.
Just for fun I removed the skiplist index for status and the query execs
1.5 secs. This one was unexspected.
Am Montag, 25. April 2016 16:41:21 UTC+2 schrieb Wilfried Gösgens:
>
> Hi,
> yes, you're right.
> The index doesn't have such statistics.
> As mentioned above, the COLLECT has to do string comparison for each item.
> By actualy providing it with the Buckets, it can lookup these directly via
> the index.
> for this query:
>
> for status in [200,301,302,404,500,503]
> return length(for doc in users
> filter doc.status == status
> return doc)
>
> you could probably increase the speed once more by not returning the
> document (return doc) but simply 1 (return 1) since you don't use the
> actual document anyways.
> You would probably want to run it like that:
>
> for status in [200,301,302,404,500,503]
> return {statusOf: status, count: length(
> for doc in users filter doc.status == status return 1)}
>
> To understand the actualy performance difference, one should inspect the
> results of explain; One reason could be thath '[status]' uses V8 to execute:
>
> db._explain('for status in [200,301,302,404,500,503] return {[status] :
> length(for doc in test filter doc.status == status return 1)} ')
> Query string:
> for status in [200,301,302,404,500,503] return {[status] : length(for doc
> in test filter doc.status
> == status return 1)}
>
> Execution plan:
> Id NodeType Est. Comment
> 1 SingletonNode 1 * ROOT
> 2 CalculationNode 1 - LET #4 = [ 200, 301, 302, 404, 500,
> 503 ] /* json expression */ /* const assignment */
> 3 EnumerateListNode 6 - FOR status IN #4 /* list iteration
> */
> 10 SubqueryNode 6 - LET #2 = ... /* subquery */
> 4 SingletonNode 1 * ROOT
> 8 CalculationNode 1 - LET #8 = 1 /* json expression
> */ /* const assignment */
> 13 IndexNode 1 - FOR doc IN test /* skiplist
> index scan */
> 9 ReturnNode 1 - RETURN #8
> 11 CalculationNode 6 - LET #10 = { [ status ] : LENGTH(#2)
> } /* v8 expression */
> 12 ReturnNode 6 - RETURN #10
>
> Indexes used:
> By Type Collection Unique Sparse Selectivity
> Fields Ranges
> 13 skiplist test false false n/a [ `status`
> ] (doc.`status` == status)
>
> However, at least on our linux boxes the difference isn't as huge as for
> you. What host do you use? Does it have enough memory available?
>
> Cheers,
> Willi
> On Thursday, April 21, 2016 at 1:01:10 PM UTC+2, Manu wrote:
>>
>> Hi,
>>
>> thanks for the explanation.
>>
>> Am I right in assuming that the skiplist index has no precomputed groups
>> or stats about groups?
>>
>> When I do a AQL query
>>
>> RETURN LENGTH(FOR doc IN docs
>> FILTER doc.status == N
>> RETURN doc)
>>
>> Then the sum for every runtime is about 700ms.
>>
>> More interesting about 2,5s difference between
>>
>> return { [status]: length(AQL query)} and return {[status]: (prev. AQL
>> query result length)} :
>>
>> // query
>> // ~789ms
>> for status in [200,301,302,404,500,503]
>> return length(for doc in users
>> filter doc.status == status
>> return doc)
>>
>> // query
>> // ~3381ms
>> for status in [200,301,302,404,500,503]
>> return {[status]:length(for doc in users
>> filter doc.status == status
>> return doc)}
>>
>> this Query will also run ~700ms
>> // query
>> // ~789ms
>> let stati = [200,301,302,404,500,503]
>> let res = (for status in stati
>> return length(for doc in users
>> filter doc.status == status
>> return doc))
>> for status in stati
>> return { [status]: res[ POSITION(stati, status, true) ]}
>>
>>
>> Am Donnerstag, 21. April 2016 10:21:42 UTC+2 schrieb Wilfried Gösgens:
>>>
>>> Hi,
>>> lets have a look at what this query has to do behind the curtains:
>>>
>>> - it has to look at all documents
>>> - since the index is used, it gets values of one group at a time
>>> - it has to compare the last and the next value to detect a group change
>>>
>>> If it wouldn't have the index, it would have to do more compares to sort
>>> the values into their respective buckets.
>>> Since your value distribution is not to big, even without the index you
>>> get a fair chance to hit a next element belonging to the same group, so the
>>> gains aren't that huge:
>>> Testing it at my machine made the collect in double time without index.
>>>
>>> With a broader distribution of the values you would probably gain more
>>> with the index, but still the comparisons is whats expensive here, and I
>>> think the 1.7s aren't bad for that.
>>>
>>> Cheers,
>>> Willi
>>>
>>> On Wednesday, April 20, 2016 at 11:28:13 AM UTC+2, Manuel wrote:
>>>>
>>>> I have ~1M docs
>>>> Example:
>>>> { status:404}, {status:500}
>>>>
>>>> A skiplist index is applied at status
>>>>
>>>> when I do the following AQL query:
>>>>
>>>> FOR doc IN docs
>>>>
>>>> COLLECT status = doc.status WITH COUNT INTO g
>>>>
>>>> RETURN {[status]: g}
>>>>
>>>>
>>>> it takes around 1,7s to get the result.
>>>>
>>>> My understanding is that this AQL query just returns the skiplist stats.
>>>>
>>>> So could this query be optimized in this way?
>>>>
>>>> // some pseudocode to express my thought
>>>> result = {}
>>>> for(list in skiplist) {
>>>> result[list.name] = list.count;
>>>> }
>>>> return [result]
>>>>
>>>> kind regards
>>>>
>>>
--
You received this message because you are subscribed to the Google Groups
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.