[arangodb-google] Re: Question about query and index usage

Thomas Weiss Tue, 18 Apr 2017 21:39:07 -0700

Trying to answer my own question, I'm looking for a way to efficiently:
1. list all things shared with a user
2. sort those things by a date
and I don't want the complexity of this query to increase with the amount 
of data that's in the DB (it's a social network feed so it will always 
grow). As explained in the previous message, a naive approach (iterating on 
users, then iterating on their publications) would always consume more CPU 
and memory.


My idea now is to denormalize the shares in a normal (not edge) collection, 
where for each share I would store a document with a field composed of: 
{userId}_{time of the share} and create a skiplist index on that field.
The query would then be:
for s in shares
filter s.compoundField >= 'userId_'
sort s.compoundField desc
filter LIKE(s.compoundField, 'userId_%')
return s

The explanation of the query looks good, as it seems to directly use the 
skiplist to target the userId AND sort. Your comments would be very 
appreciated here!

Thanks,
Thomas

On Sunday, April 9, 2017 at 8:19:21 PM UTC+8, Thomas Weiss wrote:
>
> Working on a social network project, I've recently realized that the way I 
> was fetching users' feed was not scalable:
> for f in 1 outbound 'users/<user-id>' follows
> for c, p in 1 outbound f hasPublished
> sort p.publishedAt
> return c
> As the number of users and published content grows, this query would 
> always consume more CPU and memory! 
>
> So in order to optimize that, I've started to denormalize my data by using 
> a 'isSharedWith' edge collection that links users to published content and 
> has a skiplist index on a field named 'lastPublishedAt'.
> So now my query looks like:
> for c, s in 1 inbound 'users/<user-id>' isSharedWith
> sort s.lastPublishedAt desc
> return c
>
> The "explanation" of this query is:
> Execution plan:
>  Id   NodeType          Est.   Comment
>   1   SingletonNode        1   * ROOT
>   2   TraversalNode       84     - FOR c  /* vertex */, s  /* edge */ IN 
> 1..1  /* min..maxPathDepth */ INBOUND 'users/283139442928' /* startnode */ 
>  isSharedWith
>   3   CalculationNode     84     - LET #4 = s.`lastPublishedAt`   /* 
> attribute expression */
>   4   SortNode            84     - SORT #4 DESC
>   5   ReturnNode          84     - RETURN c
>
>
> Indexes used:
>  By   Type   Collection     Unique   Sparse   Selectivity   Fields       
>         Ranges
>   2   edge   isSharedWith   false    false         1.18 %   [ `_from`, 
> `_to` ]   base INBOUND
>
>
> Traversals on graphs:
>  Id   Depth   Vertex collections   Edge collections   Options             
>                       Filter conditions
>   2   1..1                         isSharedWith       uniqueVertices: none
> , uniqueEdges: path   
>
>
> Optimization rules applied:
>  none
>
> But this still doesn't look good to me; it seems that a full traversal is 
> first performed in order to retrieve lastPublishedAt, and then sort on that 
> field.
>
> So my question is, would there be a way to denormalize + query that kind 
> of data in order to make sure that the complexity of the query (getting the 
> x most recent elements) doesn't grow with the amount of data?
>
> Thanks in advance,
> Thomas
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[arangodb-google] Re: Question about query and index usage

Reply via email to