*tldr*

My project involves Wikipedia's pagelinks dataset. When imported into in 
Neo4j, this results in a large directed graph with ~11m nodes and ~172m 
relationships. I want to efficiently find the shortest path between any two 
nodes in the graph. With my current query--and after tweaking with Java's 
memory settings--the query takes ~60 seconds to return a path. I would like 
feedback to decrease this response time.

*details*

My *setup* is a MacBook Air (1.3 GHz Intel Core i5, 4G 1600 MHz DDR3, OS X 
10.9.5) with Neo4j (v. 2.1.5) and Java (v. 1.7.0_71) installed. 

Here's my github repo <https://github.com/erabug/wikigraph> (the readme 
contains more details for the following methods).

I successfully batch imported my nodes.csv and rels.csv files into Neo4j. 
As I mentioned above, this produces a graph with ~11m nodes and ~172m 
relationships. 

The *data model* is simple: All Wikipedia pages are nodes with an id and 
title ('node', 'name') as well as a label for its category (all nodes are 
'Pages', some have specific categories also, e.g. 'OfficeHolder'). There is 
only one relationship type, 'LINKS_TO', that describes which pages the node 
links to. 

graph structure: (Page) -[:LINKS_TO]-> (Page)

Here is the *query* I use, via py2neo (v. 1.6.4) CypherQuery object:

query = neo4j.CypherQuery(
    graph_db, 
    """MATCH (m {node:'%s'}), (n {node:'%s'}),     p = 
shortestPath((m)-[*..20]->(n)) RETURN p""" % (node1, node2)
)
path = query.execute_one()

Auto-indexing (on 'node', e.g. id number) is turned on. Increases in 
*java.initmemory* and *java.maxmemory* had a dramatic effect on response 
time. At default settings for both (512MB), the shortest path was returned 
in ~27 minutes. At any setting higher than 4G (currently using 8192MB), the 
path is returned in ~60 seconds. I also tweaked settings in 
neo4j.properties, but saw no noticeable decreases.

*logs*

messages.log <https://gist.github.com/erabug/e2e683fbeae124804370>

*what I've found from googling*

Neo4j Cypher path finding slow in undirected graph 
<http://stackoverflow.com/questions/15456345/neo4j-cypher-path-finding-slow-in-undirected-graph>,
 
Tuning neo4j for performance 
<http://stackoverflow.com/questions/17661902/tuning-neo4j-for-performance>, 
and Neo4j's Performance Guide 
<http://neo4j.com/docs/stable/performance-guide.html>. However, I'm not 
sure I know enough Java to try some of the suggestions on my own. If that's 
what is required to increase response time, I'm happy to learn, but I 
wanted to make sure it was the right approach first.

*server experiment*

I also deployed to an Amazon EC2 instance (t2.micro, 1G memory, 1vCU, 
ubuntu), just to experiment. I tried to change the same neo4j java settings 
there as I had on my local machine, but I could not run the neo4j server 
with anything other than the defaults. As a result, the query there takes 
~22 mins.

*feedback*

I would love advice about the query, my settings, and things to try on the 
server (where I will ultimately want to house my project). Please let me 
know if I can provide any further information or clarification.

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to