You're right the lucene based import shouldn't fail for memory problems, I will look into that.
My suggestion is valid if you want to use an in memory map to speed up the import. And if you're able to perhaps analyze / partition your data that might be a viable solution. Will get back to you with the findings later. Michael Am 10.06.2011 um 09:02 schrieb Paul Bandler: > > On 9 Jun 2011, at 22:12, Michael Hunger wrote: > >> Please keep in mind that the HashMap of 10M strings -> longs will take a >> substantial amount of heap memory. >> That's not the fault of Neo4j :) On my system it alone takes 1.8 G of memory >> (distributed across the strings, the hashmap-entries and the longs). > > > Fair enough, but removing the Map and using the Index instead and setting > the cache_type to weak makes almost no difference to the programs behaviour > in terms of progressively consuming the heap until it fails. I did this, > including removal of the allocation of the Map, and watched to heap > consumption follow a similar pattern until it failed as below. > >> Or you should perhaps use an amazon ec2 instance which you can easily get >> with up to 68 G of RAM :) > > With respect, and while I notice the smile, throwing memory at it is not an > option for a large set of enterprise applications that might actually be > willing to pay to use Neo4j if it didn't fail at the first hurdle when > confronted with a trivial and small scale data load... > > runImport failed after 2,072 seconds.... > > Creating data took 316 seconds > Physical mem: 1535MB, Heap size: 1016MB > use_memory_mapped_buffers=false > neostore.propertystore.db.index.keys.mapped_memory=1M > neostore.propertystore.db.strings.mapped_memory=52M > neostore.propertystore.db.arrays.mapped_memory=60M > neo_store=N:\TradeModel\target\hepper\neostore > neostore.relationshipstore.db.mapped_memory=76M > neostore.propertystore.db.index.mapped_memory=1M > neostore.propertystore.db.mapped_memory=62M > dump_configuration=true > cache_type=weak > neostore.nodestore.db.mapped_memory=17M > 1000000 nodes created. Took 59906 > 2000000 nodes created. Took 64546 > 3000000 nodes created. Took 74577 > 4000000 nodes created. Took 82607 > 5000000 nodes created. Took 171091 > Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: > Java heap space > at java.io.BufferedOutputStream.<init>(Unknown Source) > at java.io.BufferedOutputStream.<init>(Unknown Source) > at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown > Source) > at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: > Java heap space > at java.io.BufferedInputStream.<init>(Unknown Source) > at java.io.BufferedInputStream.<init>(Unknown Source) > at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown > Source) > at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: > Java heap space > at java.io.BufferedOutputStream.<init>(Unknown Source) > at java.io.BufferedOutputStream.<init>(Unknown Source) > at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown > Source) > at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: > Java heap space > at java.io.BufferedInputStream.<init>(Unknown Source) > at java.io.BufferedInputStream.<init>(Unknown Source) > at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown > Source) > at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > > > > >> So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j + >> its caches. >> >> Of course you're free to shard you map (e.g. by first letter of the name) >> and persist those maps to disk and reload them if needed. But that's an >> application level concern. >> If your are really limited that way wrt memory you should try Chris Giorans >> implementation which will take care of that. Or you should perhaps use an >> amazon ec2 instance which you can easily get with up to 68 G of RAM :) >> >> Cheers >> >> Michael >> >> >> P.S. As a side-note: >> For the rest of the memory: >> Have you tried to use weak reference cache instead of the default soft one? >> in your config.properties add >> cache_type = weak >> that should take care of your memory problems (and the stopping which is >> actually the GC trying to reclaim memory). >> >> Am 09.06.2011 um 22:36 schrieb Paul Bandler: >> >>> I ran Michael’s example test import program with the Map replacing the >>> index on my on more modestly configured machine to see whether the import >>> scaling problems I have reported previously using Batchinserter were >>> reproduced. They were – I gave the program 1G of heap and watched it run >>> using jconsole. It ran reasonably quickly as it consumed the in an almost >>> straight line until it neared its capacity then practically stopped for >>> about 20 minutes after which it died with an out of memory error – see >>> below. >>> >>> Now I’m not saying that Neo4j should necessarily go out of its way to >>> support very memory constrained environments, but I do think that it is not >>> unreasonable to expect its batch import mechanism not to fall over in this >>> way but should rather flush its buffers or whatever without requiring the >>> import application writer to shut it down and restart it periodically... >>> >>> Creating data took 331 seconds >>> 1000000 nodes created. Took 29001 >>> 2000000 nodes created. Took 35107 >>> 3000000 nodes created. Took 35904 >>> 4000000 nodes created. Took 66169 >>> 5000000 nodes created. Took 63280 >>> 6000000 nodes created. Took 183922 >>> 7000000 nodes created. Took 258276 >>> >>> com.nomura.smo.rdm.neo4j.restore.Hepper >>> createData(330.364seconds) >>> runImport (1,485 seconds later...) >>> java.lang.OutOfMemoryError: Java heap space >>> at java.util.ArrayList.<init>(Unknown Source) >>> at java.util.ArrayList.<init>(Unknown Source) >>> at >>> org.neo4j.kernel.impl.nioneo.store.PropertyRecord.<init>(PropertyRecord.java:33) >>> at >>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createPropertyChain(BatchInserterImpl.java:425) >>> at >>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createNode(BatchInserterImpl.java:143) >>> at com.nomura.smo.rdm.neo4j.restore.Hepper.runImport(Hepper.java:61) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) >>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) >>> at java.lang.reflect.Method.invoke(Unknown Source) >>> at >>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) >>> at >>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) >>> at >>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) >>> at >>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) >>> at >>> org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79) >>> at >>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71) >>> at >>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49) >>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) >>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) >>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) >>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) >>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) >>> at org.junit.runners.ParentRunner.run(ParentRunner.java:236) >>> at >>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49) >>> at >>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) >>> at >>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) >>> at >>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) >>> at >>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) >>> at >>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) >>> >>> >>> Regards, >>> Paul Bandler >>> On 9 Jun 2011, at 12:27, Michael Hunger wrote: >>> >>>> I recreated Daniels code in Java, mainly because some things were missing >>>> from his scala example. >>>> >>>> You're right that the index is the bottleneck. But with your small data >>>> set it should be possible to cache the 10m nodes in a heap that fits in >>>> your machine. >>>> >>>> I ran it first with the index and had about 8 seconds / 1M nodes and 320 >>>> sec/1M rels. >>>> >>>> Then I switched to 3G heap and a HashMap to keep the name=>node lookup and >>>> it went to 2s/1M nodes and 13 down-to 3 sec for 1M rels. >>>> >>>> That is the approach that Chris takes only that his solution can persist >>>> the map to disk and is more efficient :) >>>> >>>> Hope that helps. >>>> >>>> Michael >>>> >>>> package org.neo4j.load; >>>> >>>> import org.apache.commons.io.FileUtils; >>>> import org.junit.Test; >>>> import org.neo4j.graphdb.RelationshipType; >>>> import org.neo4j.graphdb.index.BatchInserterIndex; >>>> import org.neo4j.graphdb.index.BatchInserterIndexProvider; >>>> import org.neo4j.helpers.collection.MapUtil; >>>> import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider; >>>> import org.neo4j.kernel.impl.batchinsert.BatchInserter; >>>> import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl; >>>> >>>> import java.io.*; >>>> import java.util.HashMap; >>>> import java.util.Map; >>>> import java.util.Random; >>>> >>>> /** >>>> * @author mh >>>> * @since 09.06.11 >>>> */ >>>> public class Hepper { >>>> >>>> public static final int REPORT_COUNT = Config.MILLION; >>>> >>>> enum MyRelationshipTypes implements RelationshipType { >>>> BELONGS_TO >>>> } >>>> >>>> public static final int COUNT = Config.MILLION * 10; >>>> >>>> @Test >>>> public void createData() throws IOException { >>>> long time = System.currentTimeMillis(); >>>> final PrintWriter writer = new PrintWriter(new BufferedWriter(new >>>> FileWriter("data.txt"))); >>>> Random r = new Random(-1L); >>>> for (int nodes = 0; nodes < COUNT; nodes++) { >>>> writer.printf("%07d|%07d|%07d%n", nodes, r.nextInt(COUNT), >>>> r.nextInt(COUNT)); >>>> } >>>> writer.close(); >>>> System.out.println("Creating data took "+ (System.currentTimeMillis() >>>> - time) / 1000 +" seconds"); >>>> } >>>> >>>> @Test >>>> public void runImport() throws IOException { >>>> Map<String,Long> cache=new HashMap<String, Long>(COUNT); >>>> final File storeDir = new File("target/hepper"); >>>> FileUtils.deleteDirectory(storeDir); >>>> BatchInserter inserter = new >>>> BatchInserterImpl(storeDir.getAbsolutePath()); >>>> final BatchInserterIndexProvider indexProvider = new >>>> LuceneBatchInserterIndexProvider(inserter); >>>> final BatchInserterIndex index = indexProvider.nodeIndex("pages", >>>> MapUtil.stringMap("type", "exact")); >>>> BufferedReader reader = new BufferedReader(new FileReader("data.txt")); >>>> String line = null; >>>> int nodes = 0; >>>> long time = System.currentTimeMillis(); >>>> long batchTime=time; >>>> while ((line = reader.readLine()) != null) { >>>> final String[] nodeNames = line.split("\\|"); >>>> final String name = nodeNames[0]; >>>> final Map<String, Object> props = MapUtil.map("name", name); >>>> final long node = inserter.createNode(props); >>>> //index.add(node, props); >>>> cache.put(name,node); >>>> nodes++; >>>> if ((nodes % REPORT_COUNT) == 0) { >>>> System.out.printf("%d nodes created. Took %d %n", nodes, >>>> (System.currentTimeMillis() - batchTime)); >>>> batchTime = System.currentTimeMillis(); >>>> } >>>> } >>>> >>>> System.out.println("Creating nodes took "+ (System.currentTimeMillis() >>>> - time) / 1000); >>>> index.flush(); >>>> reader.close(); >>>> reader = new BufferedReader(new FileReader("data.txt")); >>>> int rels = 0; >>>> time = System.currentTimeMillis(); >>>> batchTime=time; >>>> while ((line = reader.readLine()) != null) { >>>> final String[] nodeNames = line.split("\\|"); >>>> final String name = nodeNames[0]; >>>> //final Long from = index.get("name", name).getSingle(); >>>> Long from =cache.get(name); >>>> for (int j = 1; j < nodeNames.length; j++) { >>>> //final Long to = index.get("name", nodeNames[j]).getSingle(); >>>> final Long to = cache.get(name); >>>> inserter.createRelationship(from, to, >>>> MyRelationshipTypes.BELONGS_TO,null); >>>> } >>>> rels++; >>>> if ((rels % REPORT_COUNT) == 0) { >>>> System.out.printf("%d relationships created. Took %d %n", >>>> rels, (System.currentTimeMillis() - batchTime)); >>>> batchTime = System.currentTimeMillis(); >>>> } >>>> } >>>> System.out.println("Creating relationships took "+ >>>> (System.currentTimeMillis() - time) / 1000); >>>> } >>>> } >>>> >>>> >>>> 1000000 nodes created. Took 2227 >>>> 2000000 nodes created. Took 1930 >>>> 3000000 nodes created. Took 1818 >>>> 4000000 nodes created. Took 1966 >>>> 5000000 nodes created. Took 1857 >>>> 6000000 nodes created. Took 2009 >>>> 7000000 nodes created. Took 2068 >>>> 8000000 nodes created. Took 1991 >>>> 9000000 nodes created. Took 2151 >>>> 10000000 nodes created. Took 2276 >>>> Creating nodes took 20 >>>> 1000000 relationships created. Took 13441 >>>> 2000000 relationships created. Took 12887 >>>> 3000000 relationships created. Took 12922 >>>> 4000000 relationships created. Took 13149 >>>> 5000000 relationships created. Took 14177 >>>> 6000000 relationships created. Took 3377 >>>> 7000000 relationships created. Took 2932 >>>> 8000000 relationships created. Took 2991 >>>> 9000000 relationships created. Took 2992 >>>> 10000000 relationships created. Took 2912 >>>> Creating relationships took 81 >>>> >>>> Am 09.06.2011 um 12:51 schrieb Chris Gioran: >>>> >>>>> Hi Daniel, >>>>> >>>>> I am working currently on a tool for importing big data sets into Neo4j >>>>> graphs. >>>>> The main problem in such operations is that the usual index >>>>> implementations are just too >>>>> slow for retrieving the mapping from keys to created node ids, so a >>>>> custom solution is >>>>> needed, that is dependent to a varying degree on the distribution of >>>>> values of the input set. >>>>> >>>>> While your dataset is smaller than the data sizes i deal with, i would >>>>> like to use it as a test case. If you could >>>>> provide somehow the actual data or something that emulates them, I >>>>> would be grateful. >>>>> >>>>> If you want to see my approach, it is available here >>>>> >>>>> https://github.com/digitalstain/BigDataImport >>>>> >>>>> The core algorithm is an XJoin style two-level-hashing scheme with >>>>> adaptable eviction strategies but it is not production ready yet, >>>>> mainly from an API perspective. >>>>> >>>>> You can contact me directly for any details regarding this issue. >>>>> >>>>> cheers, >>>>> CG >>>>> >>>>> On Thu, Jun 9, 2011 at 12:59 PM, Daniel Hepper <daniel.hep...@gmail.com> >>>>> wrote: >>>>>> Hi all, >>>>>> >>>>>> I'm struggling with importing a graph with about 10m nodes and 20m >>>>>> relationships, with nodes having 0 to 10 relationships. Creating the >>>>>> nodes takes about 10 minutes, but creating the relationships is slower >>>>>> by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with >>>>>> 4GB RAM and conventional HDD. >>>>>> >>>>>> The graph is stored as adjacency list in a text file where each line >>>>>> has this form: >>>>>> >>>>>> Foo|Bar|Baz >>>>>> (Node Foo has relations to Bar and Baz) >>>>>> >>>>>> My current approach is to iterate over the whole file twice. In the >>>>>> first run, I create a node with the property "name" for the first >>>>>> entry in the line (Foo in this case) and add it to an index. >>>>>> In the second run, I get the start node and the end nodes from the >>>>>> index by name and create the relationships. >>>>>> >>>>>> My code can be found here: http://pastie.org/2041801 >>>>>> >>>>>> With my approach, the best I can achieve is 100 created relationships >>>>>> per second. >>>>>> I experimented with mapped memory settings, but without much effect. >>>>>> Is this the speed I can expect? >>>>>> Any advice on how to speed up this process? >>>>>> >>>>>> Best regards, >>>>>> Daniel Hepper >>>>>> _______________________________________________ >>>>>> Neo4j mailing list >>>>>> User@lists.neo4j.org >>>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>>> >>>>> _______________________________________________ >>>>> Neo4j mailing list >>>>> User@lists.neo4j.org >>>>> https://lists.neo4j.org/mailman/listinfo/user >>>> >>>> _______________________________________________ >>>> Neo4j mailing list >>>> User@lists.neo4j.org >>>> https://lists.neo4j.org/mailman/listinfo/user >>> >>> _______________________________________________ >>> Neo4j mailing list >>> User@lists.neo4j.org >>> https://lists.neo4j.org/mailman/listinfo/user >> >> _______________________________________________ >> Neo4j mailing list >> User@lists.neo4j.org >> https://lists.neo4j.org/mailman/listinfo/user > > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user