Re: [arangodb-google] Performance drop dramatically when levels get deeper in graph travelsal

Michael Hackstein Fri, 20 Jan 2017 02:03:55 -0800

Hi Hugh,

when traversing the amount of documents we have to walk through will increase 
(typically exponentially) with every additional depth.
So i would expect an increase of runtime for higher depth.
However i am with you that the time you measure seems to be quite high and i 
would like to help you with this.


First of all i have tried your query and i could see that for some reason the:
filter @domain in p.edges[0].server_name
Is not optimized correctly. This seems to be an internal issue with the 
optimization rule not being good enough, i will take a detailed look into this 
and try to make sure that it works as expected.
For this reason it will not yet be able to use a different index for this case, 
and will not do short-circuit to abort search on level 1 correctly.
I am very sorry for the inconvenience, as the way you did it should be the 
correct one.

To have a quick workaround for now you could split the first part of the query 
in a separate step:

This is the fast version of my modified query (which will not include the 
nginx, see slower version)

FOR n IN nginx
FOR forwarded, e IN 1 OUTBOUND forward
FILTER @domain IN e.server_name
/* At this point we only have the relevant first depth vertices*/
FOR v IN 0..3 OUTBOUND forward, dispatch, route, INBOUND deployto, referto, 
monitoron
RETURN {id: v._id, type: v.ci_type}

This is a slightly slower version of my modified query (saving your output 
format, and i think it will be faster than the one your are working with)

FOR tmp IN(
  FOR n IN nginx
    FOR forwarded, e IN 1 OUTBOUND forward
      FILTER @domain IN e.server_name
      /* At this point we only have the relevant first depth vertices*/
      RETURN APPEND([{id: n._id, type: n.ci_type}],(
       FOR v IN 0..3 OUTBOUND forward, dispatch, route, INBOUND deployto, 
referto, monitoron
       RETURN {id: v._id, type: v.ci_type}
      )
  )[**]
RETURN tmp

In addition without having the data i can just give some general advise:

1. (This will work after we fixed the optimizer) Usage of the index: ArangoDB 
uses statistics/assumptions of the index selectivity (how good it is to find 
the data) to decide which index is better. In your case it may assume that the 
edge-index is better than your hash-index. You could try to create a combined 
hash_index on ["_from", "server_name[*]"] which is more likely to have a better 
estimate than the EdgeIndex and could be used.
2. In the example you have given i can see that there is a "large" right part 
starting at the apppkg node. In the query this right part an be reached in two 
ways:
  a) nginx -> tomcat <- apppkg
  b) nginx -> varnish -> lvs -> tomcat <- apppkg
  This means the query could walk through the subtree starting at apppkg 
multiple times (once for every path leading there). With the query depth of 4 
and only this topology it does not happen, but if there are shorter paths this 
may also be an issue. If i am not mistaken than you are only interested in the 
distinct vertices in the graph and the path is not important right? If so you 
can add OPTIONS to the query that will make sure that no vertex (and dependent 
subtree) is analysed twice. The modified query would look like this:
  
for n in nginx
for v,e,p in 0..4 outbound n forward, dispatch, route, INBOUND deployto, 
referto, monitoron
OPTIONS {bfs: true, uniqueVertices: "global"}
filter @domain in p.edges[0].server_name
return {id: v._id, type: v.ci_type}
  
the change i made is that i add options to the traversal:
  bfs: true => Means we do a breadth-first-search instead of a 
depth-first-search, we only need this to make the result deterministic and make 
sure that all vertices with a path of depth 4 will be reached correctly
  uniqueVertices: "global" => Means whenever a vertex is found in one traversal 
(so in your case for every nginx separately) it is flagged and will not be 
looked at again.

  If you need the list of all distinct edges as well you should use 
`uniqueEdges: "global"` instead of `uniqueVertices: "global"` which will make 
this uniqueness check on edge level.


I can probably give you some more detailed advise if i could get an example 
dataset (anonymized) for your usecase.
Especially the usage of the index depends on estimates and it may be that the 
server assumes the default EdgeIndex to be better than the combined hash-index, 
but i can only validate that if i have some data.

If you do not want to share the data publicly but can shared it privately with 
us please send it to [email protected] <mailto:[email protected]>
We are also willing to sign an NDA if required.

best
Michael

> Am 18.01.2017 um 04:44 schrieb Hugh Chen <[email protected]>:
> 
> I've been working on a config management system using arangodb which collect 
> config data for some common software and stream to a program which will 
> generate the relationship among those softwares based on some pre-defined 
> rules and then save the relations into arangodb. After the relations 
> established, I provides APIs to query the data. One important query is to 
> generate the topology of these softwares. I use graph traversal to generate 
> the topology with following AQL:
> 
> for n in nginx
> for v,e,p in 0..4 outbound n forward, dispatch, route, INBOUND deployto, 
> referto,monitoron
> filter @domain in p.edges[0].server_name
> return {id: v._id, type: v.ci_type}
> 
> which can generate the following topology:
> 
>  
> <https://lh3.googleusercontent.com/-Qb39JcyO_HM/WH7hwJeTRUI/AAAAAAAAAG8/v7pDHlhmPGwATo5HIoXyB56ri1S4Y9daQCLcB/s1600/generated-topology.png>
> 
> Which looks fine. However, It takes around 10 seconds to finish the query 
> which is not acceptable because the volume is not very large. I checked all 
> the collections and the largest collection, the "forward" edge collection 
> only has around 28000 documents. So I did some tests:
> I changed depth from 0..4 to 0..2 and it only takes 0.3 second to finish the 
> query
> I changed depth from 0..4 to 0..3, it takes around 3 seconds
> for 0..4, it takes around 10 seconds
> Since there is a server_name property on the "forward" edge, so I add a hash 
> index(server_name[*]) but it seems arangodb doesn't use the index from the 
> explain execute plan
> Any tips I can optimize the query? and why the index can't be used in this 
> case?
> 
> Hope someone can help me out with this. Thanks in advance,
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "ArangoDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [arangodb-google] Performance drop dramatically when levels get deeper in graph travelsal

Reply via email to