Working on a social network project, I've recently realized that the way I
was fetching users' feed was not scalable:
for f in 1 outbound 'users/<user-id>' follows
for c, p in 1 outbound f hasPublished
sort p.publishedAt
return c
As the number of users and published content grows, this query would always
consume more CPU and memory!
So in order to optimize that, I've started to denormalize my data by using
a 'isSharedWith' edge collection that links users to published content and
has a skiplist index on a field named 'lastPublishedAt'.
So now my query looks like:
for c, s in 1 inbound 'users/<user-id>' isSharedWith
sort s.lastPublishedAt desc
return c
The "explanation" of this query is:
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 TraversalNode 84 - FOR c /* vertex */, s /* edge */ IN 1..
1 /* min..maxPathDepth */ INBOUND 'users/283139442928' /* startnode */
isSharedWith
3 CalculationNode 84 - LET #4 = s.`lastPublishedAt` /*
attribute expression */
4 SortNode 84 - SORT #4 DESC
5 ReturnNode 84 - RETURN c
Indexes used:
By Type Collection Unique Sparse Selectivity Fields
Ranges
2 edge isSharedWith false false 1.18 % [ `_from`, `_to`
] base INBOUND
Traversals on graphs:
Id Depth Vertex collections Edge collections Options
Filter conditions
2 1..1 isSharedWith uniqueVertices: none,
uniqueEdges: path
Optimization rules applied:
none
But this still doesn't look good to me; it seems that a full traversal is
first performed in order to retrieve lastPublishedAt, and then sort on that
field.
So my question is, would there be a way to denormalize + query that kind of
data in order to make sure that the complexity of the query (getting the x
most recent elements) doesn't grow with the amount of data?
Thanks in advance,
Thomas
--
You received this message because you are subscribed to the Google Groups
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.