Re: [Neo4j] Scalable solution for updating graph data in Neo4j 2.0

Michael Hunger Sat, 01 Feb 2014 13:08:42 -0800

What is your actual write load?
How big was your batch size? Currently for 2.1 1000 elements is sensible. It 
will change back to 30-50k for Neo4j 2.1



#0 use parameters
> MERGE (user:User { name:{user_name} })', 'MERGE (tweet:Tweet { 
> tweet_id:{tweet_id} })

#1 can you share your server config / memory / disk etc? (best to share your 
data/graph.db/messages.log)
#2 Make sure your driver uses the new transactional endpoint and streams data 
back and forth

Usually you can insert 5-10k nodes per second in 2.0 with MERGE and parameters 
in batched tx (1k tx-size)



Am 01.02.2014 um 17:51 schrieb Yun Wang <[email protected]>:

> Question background
> We are building a graph (database) for twitter users and tweets (batched 
> updates for new data).
> We store as graph nodes: each-user, each-tweet
> We store as graph edges: tweet-tweet relationships, and user-user 
> relationships (derived, based on users who retweet or reply to others).
>  
> Problem: Updating the graph is very slow / not scalable
>  
> Goal: Scalable / efficient update of the existing Neo4J graph as new tweets 
> come in (tweets translate to: nodes, edges).  Constraint: If a node (e.g., 
> user) already exists, we do not want to duplicate it. Similarly, if an edge 
> (user-user relationship) exists, we only want to update the edge weight.
>  
> What we have tried:
> Option 1: We tried using Cypher's 'MERGE' function to uniquely insert. We 
> also executed Cypher queries in a batch in order to reduce REST latency.
>  
> Sample Cypher query used to update database:
>             'MERGE (user:User { name:'tom' })', 'MERGE (tweet:Tweet { 
> tweet_id:'101' })'
>  
> We created an index on node attributes like 'name' of User node and 
> 'tweet_id' of Tweet node.
> We increased the 'open file descriptors' parameter to gain better performance 
> in Linux.
>  
> Problems with Option 1:
> Performance of checking uniqueness using 'MERGE' function dropped 
> dramatically with scale / over time. For example, it took 2.7 second to 
> insert 100 records when the database was empty. However, it took 62 seconds 
> to insert the same amount of data with 100,000 existing records.
>  
> Option 2: The other option we have tried is to check uniqueness externally. 
> That is, take all nodes and edges and create a hash table outside Neo4J 
> (e.g., in Python or Java) to check uniqueness. This is faster than the 
> earlier 'MERGE' function over time. However, it does not seem elegant to have 
> to extract existing nodes before each batch update. It requires a read + 
> write from the Neo4J database, instead of only a write.
>  
> We are wondering if there is an elegant solution for large data updating in 
> Neo4j. We feel this may be a common question for many users, and someone may 
> have previously encountered this and/or developed a robust solution.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] Scalable solution for updating graph data in Neo4j 2.0

Reply via email to