Re: [Neo4j] Speeding up initial import of graph

Michael Hunger Fri, 10 Jun 2011 00:27:47 -0700

You're right the lucene based import shouldn't fail for memory problems, I will 
look into that.


My suggestion is valid if you want to use an in memory map to speed up the 
import. And if you're able to perhaps analyze / partition your data that might 
be a viable solution.

Will get back to you with the findings later.

Michael

Am 10.06.2011 um 09:02 schrieb Paul Bandler:

> 
> On 9 Jun 2011, at 22:12, Michael Hunger wrote:
> 
>> Please keep in mind that the HashMap of 10M strings -> longs will take a 
>> substantial amount of heap memory.
>> That's not the fault of Neo4j :) On my system it alone takes 1.8 G of memory 
>> (distributed across the strings, the hashmap-entries and the longs).
> 
> 
> Fair enough,  but removing the Map and using the Index instead and setting 
> the cache_type to weak makes almost no difference to the programs behaviour 
> in terms of progressively consuming the heap until it fails.  I did this, 
> including removal of the allocation of the Map, and watched to heap 
> consumption follow a similar pattern until it failed as below.
> 
>> Or you should perhaps use an amazon ec2 instance which you can easily get 
>> with up to 68 G of RAM :)
> 
> With respect, and while I notice the smile, throwing memory at it is not an 
> option for a large set of enterprise applications that might actually be 
> willing to pay to use Neo4j if it didn't fail at the first hurdle when 
> confronted with a trivial and small scale data load...
> 
> runImport failed after 2,072 seconds....
> 
> Creating data took 316 seconds
> Physical mem: 1535MB, Heap size: 1016MB
> use_memory_mapped_buffers=false
> neostore.propertystore.db.index.keys.mapped_memory=1M
> neostore.propertystore.db.strings.mapped_memory=52M
> neostore.propertystore.db.arrays.mapped_memory=60M
> neo_store=N:\TradeModel\target\hepper\neostore
> neostore.relationshipstore.db.mapped_memory=76M
> neostore.propertystore.db.index.mapped_memory=1M
> neostore.propertystore.db.mapped_memory=62M
> dump_configuration=true
> cache_type=weak
> neostore.nodestore.db.mapped_memory=17M
> 1000000 nodes created. Took 59906
> 2000000 nodes created. Took 64546
> 3000000 nodes created. Took 74577
> 4000000 nodes created. Took 82607
> 5000000 nodes created. Took 171091
> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: 
> Java heap space
>        at java.io.BufferedOutputStream.<init>(Unknown Source)
>        at java.io.BufferedOutputStream.<init>(Unknown Source)
>        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
> Source)
>        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
> Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
> Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: 
> Java heap space
>        at java.io.BufferedInputStream.<init>(Unknown Source)
>        at java.io.BufferedInputStream.<init>(Unknown Source)
>        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
> Source)
>        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
> Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
> Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: 
> Java heap space
>        at java.io.BufferedOutputStream.<init>(Unknown Source)
>        at java.io.BufferedOutputStream.<init>(Unknown Source)
>        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
> Source)
>        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
> Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
> Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: 
> Java heap space
>        at java.io.BufferedInputStream.<init>(Unknown Source)
>        at java.io.BufferedInputStream.<init>(Unknown Source)
>        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
> Source)
>        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
> Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
> Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> 
> 
> 
> 
>> So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j + 
>> its caches.
>> 
>> Of course you're free to shard you map (e.g. by first letter of the name) 
>> and persist those maps to disk and reload them if needed. But that's an 
>> application level concern.
>> If your are really limited that way wrt memory you should try Chris Giorans 
>> implementation which will take care of that. Or you should perhaps use an 
>> amazon ec2 instance which you can easily get with up to 68 G of RAM :)
>> 
>> Cheers
>> 
>> Michael
>> 
>> 
>> P.S. As a side-note:
>> For the rest of the memory:
>> Have you tried to use weak reference cache instead of the default soft one?
>> in your config.properties add
>> cache_type = weak
>> that should take care of your memory problems (and the stopping which is 
>> actually the GC trying to reclaim memory).
>> 
>> Am 09.06.2011 um 22:36 schrieb Paul Bandler:
>> 
>>> I ran Michael’s  example test import program with the Map replacing the 
>>> index on my on more modestly configured machine to see whether the import 
>>> scaling problems I have reported previously using Batchinserter were 
>>> reproduced.  They were – I gave the program 1G of heap and watched it run 
>>> using jconsole.  It ran reasonably quickly as it consumed the in an almost 
>>> straight line until it neared its capacity then practically stopped for 
>>> about 20 minutes after which it died with an out of memory error – see 
>>> below.
>>> 
>>> Now I’m not saying that Neo4j should necessarily go out of its way to 
>>> support very memory constrained environments, but I do think that it is not 
>>> unreasonable to expect its batch import mechanism not to fall over in this 
>>> way but should rather flush its buffers or whatever without requiring the 
>>> import application writer to shut it down and restart it periodically...
>>> 
>>> Creating data took 331 seconds
>>> 1000000 nodes created. Took 29001
>>> 2000000 nodes created. Took 35107
>>> 3000000 nodes created. Took 35904
>>> 4000000 nodes created. Took 66169
>>> 5000000 nodes created. Took 63280
>>> 6000000 nodes created. Took 183922
>>> 7000000 nodes created. Took 258276
>>> 
>>> com.nomura.smo.rdm.neo4j.restore.Hepper
>>> createData(330.364seconds)
>>> runImport (1,485 seconds later...)
>>> java.lang.OutOfMemoryError: Java heap space
>>>      at java.util.ArrayList.<init>(Unknown Source)
>>>      at java.util.ArrayList.<init>(Unknown Source)
>>>      at 
>>> org.neo4j.kernel.impl.nioneo.store.PropertyRecord.<init>(PropertyRecord.java:33)
>>>      at 
>>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createPropertyChain(BatchInserterImpl.java:425)
>>>      at 
>>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createNode(BatchInserterImpl.java:143)
>>>      at com.nomura.smo.rdm.neo4j.restore.Hepper.runImport(Hepper.java:61)
>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>      at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>>>      at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>>      at java.lang.reflect.Method.invoke(Unknown Source)
>>>      at 
>>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>>>      at 
>>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>>>      at 
>>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>>>      at 
>>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>>>      at 
>>> org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
>>>      at 
>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
>>>      at 
>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
>>>      at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>>>      at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>>>      at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>>>      at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>>>      at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>>>      at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>>>      at 
>>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49)
>>>      at 
>>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>      at 
>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
>>>      at 
>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
>>>      at 
>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
>>>      at 
>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
>>> 
>>> 
>>> Regards,
>>> Paul Bandler 
>>> On 9 Jun 2011, at 12:27, Michael Hunger wrote:
>>> 
>>>> I recreated Daniels code in Java, mainly because some things were missing 
>>>> from his scala example.
>>>> 
>>>> You're right that the index is the bottleneck. But with your small data 
>>>> set it should be possible to cache the 10m nodes in a heap that fits in 
>>>> your machine.
>>>> 
>>>> I ran it first with the index and had about 8 seconds / 1M nodes and 320 
>>>> sec/1M rels.
>>>> 
>>>> Then I switched to 3G heap and a HashMap to keep the name=>node lookup and 
>>>> it went to 2s/1M nodes and 13 down-to 3 sec for 1M rels.
>>>> 
>>>> That is the approach that Chris takes only that his solution can persist 
>>>> the map to disk and is more efficient :)
>>>> 
>>>> Hope that helps.
>>>> 
>>>> Michael
>>>> 
>>>> package org.neo4j.load;
>>>> 
>>>> import org.apache.commons.io.FileUtils;
>>>> import org.junit.Test;
>>>> import org.neo4j.graphdb.RelationshipType;
>>>> import org.neo4j.graphdb.index.BatchInserterIndex;
>>>> import org.neo4j.graphdb.index.BatchInserterIndexProvider;
>>>> import org.neo4j.helpers.collection.MapUtil;
>>>> import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider;
>>>> import org.neo4j.kernel.impl.batchinsert.BatchInserter;
>>>> import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;
>>>> 
>>>> import java.io.*;
>>>> import java.util.HashMap;
>>>> import java.util.Map;
>>>> import java.util.Random;
>>>> 
>>>> /**
>>>> * @author mh
>>>> * @since 09.06.11
>>>> */
>>>> public class Hepper {
>>>> 
>>>> public static final int REPORT_COUNT = Config.MILLION;
>>>> 
>>>> enum MyRelationshipTypes implements RelationshipType {
>>>>     BELONGS_TO
>>>> }
>>>> 
>>>> public static final int COUNT = Config.MILLION * 10;
>>>> 
>>>> @Test
>>>> public void createData() throws IOException {
>>>>     long time = System.currentTimeMillis();
>>>>     final PrintWriter writer = new PrintWriter(new BufferedWriter(new 
>>>> FileWriter("data.txt")));
>>>>     Random r = new Random(-1L);
>>>>     for (int nodes = 0; nodes < COUNT; nodes++) {
>>>>         writer.printf("%07d|%07d|%07d%n", nodes, r.nextInt(COUNT), 
>>>> r.nextInt(COUNT));
>>>>     }
>>>>     writer.close();
>>>>     System.out.println("Creating data took "+ (System.currentTimeMillis() 
>>>> - time) / 1000 +" seconds");
>>>> }
>>>> 
>>>> @Test
>>>> public void runImport() throws IOException {
>>>>     Map<String,Long> cache=new HashMap<String, Long>(COUNT);
>>>>     final File storeDir = new File("target/hepper");
>>>>     FileUtils.deleteDirectory(storeDir);
>>>>     BatchInserter inserter = new 
>>>> BatchInserterImpl(storeDir.getAbsolutePath());
>>>>     final BatchInserterIndexProvider indexProvider = new 
>>>> LuceneBatchInserterIndexProvider(inserter);
>>>>     final BatchInserterIndex index = indexProvider.nodeIndex("pages", 
>>>> MapUtil.stringMap("type", "exact"));
>>>>     BufferedReader reader = new BufferedReader(new FileReader("data.txt"));
>>>>     String line = null;
>>>>     int nodes = 0;
>>>>     long time = System.currentTimeMillis();
>>>>     long batchTime=time;
>>>>     while ((line = reader.readLine()) != null) {
>>>>         final String[] nodeNames = line.split("\\|");
>>>>         final String name = nodeNames[0];
>>>>         final Map<String, Object> props = MapUtil.map("name", name);
>>>>         final long node = inserter.createNode(props);
>>>>         //index.add(node, props);
>>>>         cache.put(name,node);
>>>>         nodes++;
>>>>         if ((nodes % REPORT_COUNT) == 0) {
>>>>             System.out.printf("%d nodes created. Took %d %n", nodes, 
>>>> (System.currentTimeMillis() - batchTime));
>>>>             batchTime = System.currentTimeMillis();
>>>>         }
>>>>     }
>>>> 
>>>>     System.out.println("Creating nodes took "+ (System.currentTimeMillis() 
>>>> - time) / 1000);
>>>>     index.flush();
>>>>     reader.close();
>>>>     reader = new BufferedReader(new FileReader("data.txt"));
>>>>     int rels = 0;
>>>>     time = System.currentTimeMillis();
>>>>     batchTime=time;
>>>>     while ((line = reader.readLine()) != null) {
>>>>         final String[] nodeNames = line.split("\\|");
>>>>         final String name = nodeNames[0];
>>>>         //final Long from = index.get("name", name).getSingle();
>>>>         Long from =cache.get(name);
>>>>         for (int j = 1; j < nodeNames.length; j++) {
>>>>             //final Long to = index.get("name", nodeNames[j]).getSingle();
>>>>             final Long to = cache.get(name);
>>>>             inserter.createRelationship(from, to, 
>>>> MyRelationshipTypes.BELONGS_TO,null);
>>>>         }
>>>>         rels++;
>>>>         if ((rels % REPORT_COUNT) == 0) {
>>>>             System.out.printf("%d relationships created. Took %d %n", 
>>>> rels, (System.currentTimeMillis() - batchTime));
>>>>             batchTime = System.currentTimeMillis();
>>>>         }
>>>>     }
>>>>     System.out.println("Creating relationships took "+ 
>>>> (System.currentTimeMillis() - time) / 1000);
>>>> }
>>>> }
>>>> 
>>>> 
>>>> 1000000 nodes created. Took 2227 
>>>> 2000000 nodes created. Took 1930 
>>>> 3000000 nodes created. Took 1818 
>>>> 4000000 nodes created. Took 1966 
>>>> 5000000 nodes created. Took 1857 
>>>> 6000000 nodes created. Took 2009 
>>>> 7000000 nodes created. Took 2068 
>>>> 8000000 nodes created. Took 1991 
>>>> 9000000 nodes created. Took 2151 
>>>> 10000000 nodes created. Took 2276 
>>>> Creating nodes took 20
>>>> 1000000 relationships created. Took 13441 
>>>> 2000000 relationships created. Took 12887 
>>>> 3000000 relationships created. Took 12922 
>>>> 4000000 relationships created. Took 13149 
>>>> 5000000 relationships created. Took 14177 
>>>> 6000000 relationships created. Took 3377 
>>>> 7000000 relationships created. Took 2932 
>>>> 8000000 relationships created. Took 2991 
>>>> 9000000 relationships created. Took 2992 
>>>> 10000000 relationships created. Took 2912 
>>>> Creating relationships took 81
>>>> 
>>>> Am 09.06.2011 um 12:51 schrieb Chris Gioran:
>>>> 
>>>>> Hi Daniel,
>>>>> 
>>>>> I am working currently on a tool for importing big data sets into Neo4j 
>>>>> graphs.
>>>>> The main problem in such operations is that the usual index
>>>>> implementations are just too
>>>>> slow for retrieving the mapping from keys to created node ids, so a
>>>>> custom solution is
>>>>> needed, that is dependent to a varying degree on the distribution of
>>>>> values of the input set.
>>>>> 
>>>>> While your dataset is smaller than the data sizes i deal with, i would
>>>>> like to use it as a test case. If you could
>>>>> provide somehow the actual data or something that emulates them, I
>>>>> would be grateful.
>>>>> 
>>>>> If you want to see my approach, it is available here
>>>>> 
>>>>> https://github.com/digitalstain/BigDataImport
>>>>> 
>>>>> The core algorithm is an XJoin style two-level-hashing scheme with
>>>>> adaptable eviction strategies but it is not production ready yet,
>>>>> mainly from an API perspective.
>>>>> 
>>>>> You can contact me directly for any details regarding this issue.
>>>>> 
>>>>> cheers,
>>>>> CG
>>>>> 
>>>>> On Thu, Jun 9, 2011 at 12:59 PM, Daniel Hepper <daniel.hep...@gmail.com> 
>>>>> wrote:
>>>>>> Hi all,
>>>>>> 
>>>>>> I'm struggling with importing a graph with about 10m nodes and 20m
>>>>>> relationships, with nodes having 0 to 10 relationships. Creating the
>>>>>> nodes takes about 10 minutes, but creating the relationships is slower
>>>>>> by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with
>>>>>> 4GB RAM and conventional HDD.
>>>>>> 
>>>>>> The graph is stored as adjacency list in a text file where each line
>>>>>> has this form:
>>>>>> 
>>>>>> Foo|Bar|Baz
>>>>>> (Node Foo has relations to Bar and Baz)
>>>>>> 
>>>>>> My current approach is to iterate over the whole file twice. In the
>>>>>> first run, I create a node with the property "name" for the first
>>>>>> entry in the line (Foo in this case) and add it to an index.
>>>>>> In the second run, I get the start node and the end nodes from the
>>>>>> index by name and create the relationships.
>>>>>> 
>>>>>> My code can be found here: http://pastie.org/2041801
>>>>>> 
>>>>>> With my approach, the best I can achieve is 100 created relationships
>>>>>> per second.
>>>>>> I experimented with mapped memory settings, but without much effect.
>>>>>> Is this the speed I can expect?
>>>>>> Any advice on how to speed up this process?
>>>>>> 
>>>>>> Best regards,
>>>>>> Daniel Hepper
>>>>>> _______________________________________________
>>>>>> Neo4j mailing list
>>>>>> User@lists.neo4j.org
>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>> 
>>>>> _______________________________________________
>>>>> Neo4j mailing list
>>>>> User@lists.neo4j.org
>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>> 
>>>> _______________________________________________
>>>> Neo4j mailing list
>>>> User@lists.neo4j.org
>>>> https://lists.neo4j.org/mailman/listinfo/user
>>> 
>>> _______________________________________________
>>> Neo4j mailing list
>>> User@lists.neo4j.org
>>> https://lists.neo4j.org/mailman/listinfo/user
>> 
>> _______________________________________________
>> Neo4j mailing list
>> User@lists.neo4j.org
>> https://lists.neo4j.org/mailman/listinfo/user
> 
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Speeding up initial import of graph

Reply via email to