> can you share your test and the CompactIndex you wrote? > > That would be great.
See below... > Also the memory settings (Xmx) you used for the different runs. The heap size is displayed by neo4j is it not with console entries such as:- >> Physical mem: 1535MB, Heap size: 1016MB So that one came fro -Xmx1024M and >> Physical mem: 4096MB, Heap size: 2039MB > came from -Xms2048M regards, Paul package com.xxx.neo4j.restore; import java.io.ByteArrayInputStream; import java.io.ByteArrayOutputStream; import java.io.DataInputStream; import java.io.DataOutputStream; public class NodeIdPair implements Comparable<NodeIdPair> { private long _node; private int _id; static final NodeIdPair _prototype = new NodeIdPair(Long.MAX_VALUE, Integer.MAX_VALUE); static Integer MY_SIZE = null; public static int size() { if (MY_SIZE == null) { MY_SIZE = (new NodeIdPair(Long.MAX_VALUE, Integer.MAX_VALUE)) .toByteArray().length; System.out.println("MY_SIZE: " + MY_SIZE); } return MY_SIZE; } public NodeIdPair(long node, int id) { _node = node; _id = id; } public NodeIdPair(byte fromByteArray[]) { ByteArrayInputStream bais = new ByteArrayInputStream(fromByteArray); DataInputStream dis = new DataInputStream(bais); try { _node = dis.readLong(); _id = dis.readInt(); } catch (Exception e) { throw new Error("Unexpected exception. byte[] len " + fromByteArray.length, e); } } byte[] toByteArray() { ByteArrayOutputStream bos = new ByteArrayOutputStream( MY_SIZE != null ? MY_SIZE : 12); DataOutputStream dos = new DataOutputStream(bos); try { dos.writeLong(_node); dos.writeInt(_id); dos.flush(); } catch (Exception e) { throw new Error("Unexpected exception: ", e); } return bos.toByteArray(); } @Override public int compareTo(NodeIdPair arg0) { return _id - arg0._id; } public long getNode() { return _node; } public int getId() { return _id; } } package com.xxx.neo4j.restore; import java.util.Arrays; import java.util.TreeSet; public class CompactNodeIndex { private int _offSet = 0; private byte _extent[]; private int _slotCount; public CompactNodeIndex(TreeSet<NodeIdPair> sortedPairs) { _extent = new byte[sortedPairs.size() * NodeIdPair.size()]; _slotCount = sortedPairs.size(); for (NodeIdPair pair : sortedPairs) { byte pairBytes[] = pair.toByteArray(); copyToExtent(pairBytes); } System.out.println("CompactNodeIndex slot count: " + _slotCount); } public NodeIdPair findNodeForId(int id) { return search(id, 0, _slotCount - 1); } @SuppressWarnings("serial") static class FoundIt extends Exception { NodeIdPair _result; FoundIt(NodeIdPair result) { _result = result; } } private NodeIdPair search(int soughtId, int lowerBound, int upperBound) { try { while (true) { if ((upperBound - lowerBound) > 1) { int compareSlot = lowerBound + ((upperBound - lowerBound) / 2); int comparison = compareAt(soughtId, compareSlot); if (comparison > 0) { lowerBound = compareSlot; continue; } else { upperBound = compareSlot; } } else { compareAt(soughtId, upperBound); compareAt(soughtId, lowerBound); // not found it return null; } } } catch (FoundIt result) { return result._result; } } private int compareAt(int soughtId, int compareSlot) throws FoundIt { NodeIdPair candidate = get(compareSlot); int diff = soughtId - candidate.getId(); if (diff == 0) throw new FoundIt(candidate); return diff; } private NodeIdPair get(int compareSlot) { int startPos = compareSlot * NodeIdPair.size(); byte serialisedPair[] = Arrays.copyOfRange(_extent, startPos, startPos + NodeIdPair.size()); return new NodeIdPair(serialisedPair); } private void copyToExtent(byte[] pairBytes) { for (byte b : pairBytes) { if (_offSet > _extent.length) throw new Error("Unexpected extent overflow: " + _offSet); _extent[_offSet++] = b; } } } On 13 Jun 2011, at 13:23, Michael Hunger wrote: > Paul, > > can you share your test and the CompactIndex you wrote? > > That would be great. > > Also the memory settings (Xmx) you used for the different runs. > > Thanks so much > > Michael > > Am 13.06.2011 um 14:15 schrieb Paul Bandler: > >> Having noticed a mention in the 1.4M04 release notes that: >> >>> Also, the BatchInserterIndex now keeps its memory usage in-check with >>> batched commits of indexed data using a configurable batch commit size. >> >> I re-ran this test using M04 and sure enough, node creation no longer eats >> up the heap linearly so that is good - I should be able to remove the >> periodic resetting of the BatchInserter during import. >> >> So I returned to the issue of removing the index creation and later access >> bottleneck using an application managed data structure as Michael >> illustrated, but needing a solution with a smaller memory footprint I wrote >> a CompactNodeIndex class for mapping integer 'id' key values to long nodes >> that uses a minimum memory footprint by overlaying a binary-choppable table >> onto a byte array. Watching heap on jconsole while this ran I could see >> that had the desired effect of releasing huge amounts of heap once it the >> CompactNodeIndex is loaded and the source data structure gc'd. However when >> I attempted to scale the test program back up to the 10M nodes Michael had >> been testing it appears to run into something of a brick wall becoming >> massively I/O bound when creating the relationships. With 1M nodes it ran >> ok, with 2M nodes not too bad, but much beyond that it crawls along using >> just about 1% of CPU but has loads of heap spare. >> >> I re-ran on a more generously configured iMac (giving the test 4G of heap) >> and it did much better in that it actually showed some progress building >> relationships over a 10M node-set, but still exhibited massive slow down >> once past 7M relationships. >> >> Below are the test results - the question now is are there any Neo4j >> parameters that might enable this I/O bottleneck that appears when building >> relationships over such sized node sets with the BatchInserter...? I note >> the section in the manual on performance parameters, but I'm afraid not >> being familiar enough with the Neo4j internals I don't feel that they give >> enough clear information on how to set them improve the performance of this >> use-case. >> >> Thanks, >> >> Paul >> >> Run 1 - Windows m/c..REPORT_COUNT = MILLION/10 >> Physical mem: 1535MB, Heap size: 1016MB >> use_memory_mapped_buffers=false >> neostore.propertystore.db.index.keys.mapped_memory=1M >> neostore.propertystore.db.strings.mapped_memory=52M >> neostore.propertystore.db.arrays.mapped_memory=60M >> neo_store=N:\TradeModel\target\hepper\neostore >> neostore.relationshipstore.db.mapped_memory=76M >> neostore.propertystore.db.index.mapped_memory=1M >> neostore.propertystore.db.mapped_memory=62M >> dump_configuration=true >> cache_type=weak >> neostore.nodestore.db.mapped_memory=17M >> 100000 nodes created. Took 2906 >> 200000 nodes created. Took 2688 >> 300000 nodes created. Took 2828 >> 400000 nodes created. Took 2953 >> 500000 nodes created. Took 2672 >> 600000 nodes created. Took 2766 >> 700000 nodes created. Took 2687 >> 800000 nodes created. Took 2703 >> 900000 nodes created. Took 2719 >> 1000000 nodes created. Took 2641 >> Creating nodes took 27 >> MY_SIZE: 12 >> CompactNodeIndex slot count: 1000000 >> 100000 relationships created. Took 4125 >> 200000 relationships created. Took 3953 >> 300000 relationships created. Took 3937 >> 400000 relationships created. Took 3610 >> 500000 relationships created. Took 3719 >> 600000 relationships created. Took 4328 >> 700000 relationships created. Took 3750 >> 800000 relationships created. Took 3609 >> 900000 relationships created. Took 4125 >> 1000000 relationships created. Took 3781 >> 1100000 relationships created. Took 4125 >> 1200000 relationships created. Took 3750 >> 1300000 relationships created. Took 3907 >> 1400000 relationships created. Took 4297 >> 1500000 relationships created. Took 3703 >> 1600000 relationships created. Took 3687 >> 1700000 relationships created. Took 4328 >> 1800000 relationships created. Took 3907 >> 1900000 relationships created. Took 3718 >> 2000000 relationships created. Took 3891 >> Creating relationships took 78 >> >> 2M Nodes on Windows m/c:- >> >> Creating data took 68 seconds >> Physical mem: 1535MB, Heap size: 1016MB >> use_memory_mapped_buffers=false >> neostore.propertystore.db.index.keys.mapped_memory=1M >> neostore.propertystore.db.strings.mapped_memory=52M >> neostore.propertystore.db.arrays.mapped_memory=60M >> neo_store=N:\TradeModel\target\hepper\neostore >> neostore.relationshipstore.db.mapped_memory=76M >> neostore.propertystore.db.index.mapped_memory=1M >> neostore.propertystore.db.mapped_memory=62M >> dump_configuration=true >> cache_type=weak >> neostore.nodestore.db.mapped_memory=17M >> 100000 nodes created. Took 3188 >> 200000 nodes created. Took 3094 >> 300000 nodes created. Took 3062 >> 400000 nodes created. Took 2813 >> 500000 nodes created. Took 2718 >> 600000 nodes created. Took 3000 >> 700000 nodes created. Took 2938 >> 800000 nodes created. Took 2828 >> 900000 nodes created. Took 4172 >> 1000000 nodes created. Took 2859 >> 1100000 nodes created. Took 3625 >> 1200000 nodes created. Took 3235 >> 1300000 nodes created. Took 2781 >> 1400000 nodes created. Took 2891 >> 1500000 nodes created. Took 2922 >> 1600000 nodes created. Took 2968 >> 1700000 nodes created. Took 3438 >> 1800000 nodes created. Took 2687 >> 1900000 nodes created. Took 2969 >> 2000000 nodes created. Took 2891 >> Creating nodes took 61 >> MY_SIZE: 12 >> CompactNodeIndex slot count: 2000000 >> 100000 relationships created. Took 311377 >> 200000 relationships created. Took 11297 >> 300000 relationships created. Took 11062 >> 400000 relationships created. Took 10891 >> 500000 relationships created. Took 11109 >> 600000 relationships created. Took 11375 >> 700000 relationships created. Took 11266 >> 800000 relationships created. Took 26469 >> 900000 relationships created. Took 46875 >> 1000000 relationships created. Took 12047 >> 1100000 relationships created. Took 43016 >> 1200000 relationships created. Took 12110 >> 1300000 relationships created. Took 12625 >> 1400000 relationships created. Took 12031 >> 1500000 relationships created. Took 40375 >> 1600000 relationships created. Took 11328 >> 1700000 relationships created. Took 11125 >> 1800000 relationships created. Took 10891 >> 1900000 relationships created. Took 11266 >> 2000000 relationships created. Took 11125 >> 2100000 relationships created. Took 11281 >> 2200000 relationships created. Took 11156 >> 2300000 relationships created. Took 11250 >> 2400000 relationships created. Took 11735 >> 2500000 relationships created. Took 15984 >> 2600000 relationships created. Took 16766 >> 2700000 relationships created. Took 71969 >> 2800000 relationships created. Took 205283 >> 2900000 relationships created. Took 159236 >> 3000000 relationships created. Took 32734 >> 3100000 relationships created. Took 149064 >> 3200000 relationships created. Took 116391 >> 3300000 relationships created. Took 74079 >> 3400000 relationships created. Took 43360 >> 3500000 relationships created. Took 20500 >> 3600000 relationships created. Took 246704 >> 3700000 relationships created. Took 74407 >> 3800000 relationships created. Took 189611 >> 3900000 relationships created. Took 44922 >> 4000000 relationships created. Took 482675 >> Creating relationships took 2628 >> >> iMac (REPORT_COUNT = MILLION) >> Physical mem: 4096MB, Heap size: 2039MB >> use_memory_mapped_buffers=false >> neostore.propertystore.db.index.keys.mapped_memory=1M >> neostore.propertystore.db.strings.mapped_memory=106M >> neostore.propertystore.db.arrays.mapped_memory=120M >> neo_store=/Users/paulbandler/Documents/workspace/Neo4jImport/target/hepper/neostore >> neostore.relationshipstore.db.mapped_memory=152M >> neostore.propertystore.db.index.mapped_memory=1M >> neostore.propertystore.db.mapped_memory=124M >> dump_configuration=true >> cache_type=weak >> neostore.nodestore.db.mapped_memory=34M >> 1000000 nodes created. Took 2817 >> 2000000 nodes created. Took 2407 >> 3000000 nodes created. Took 2086 >> 4000000 nodes created. Took 2303 >> 5000000 nodes created. Took 2912 >> 6000000 nodes created. Took 2178 >> 7000000 nodes created. Took 2241 >> 8000000 nodes created. Took 2453 >> 9000000 nodes created. Took 2627 >> 10000000 nodes created. Took 3996 >> Creating nodes took 26 >> MY_SIZE: 12 >> CompactNodeIndex slot count: 10000000 >> 1000000 relationships created. Took 198784 >> 2000000 relationships created. Took 24203 >> 3000000 relationships created. Took 25313 >> 4000000 relationships created. Took 22177 >> 5000000 relationships created. Took 22406 >> 6000000 relationships created. Took 84977 >> 7000000 relationships created. Took 402123 >> 8000000 relationships created. Took 1342290 >> >> >> On 10 Jun 2011, at 08:27, Michael Hunger wrote: >> >>> You're right the lucene based import shouldn't fail for memory problems, I >>> will look into that. >>> >>> My suggestion is valid if you want to use an in memory map to speed up the >>> import. And if you're able to perhaps analyze / partition your data that >>> might be a viable solution. >>> >>> Will get back to you with the findings later. >>> >>> Michael >>> >>> Am 10.06.2011 um 09:02 schrieb Paul Bandler: >>> >>>> >>>> On 9 Jun 2011, at 22:12, Michael Hunger wrote: >>>> >>>>> Please keep in mind that the HashMap of 10M strings -> longs will take a >>>>> substantial amount of heap memory. >>>>> That's not the fault of Neo4j :) On my system it alone takes 1.8 G of >>>>> memory (distributed across the strings, the hashmap-entries and the >>>>> longs). >>>> >>>> >>>> Fair enough, but removing the Map and using the Index instead and setting >>>> the cache_type to weak makes almost no difference to the programs >>>> behaviour in terms of progressively consuming the heap until it fails. I >>>> did this, including removal of the allocation of the Map, and watched to >>>> heap consumption follow a similar pattern until it failed as below. >>>> >>>>> Or you should perhaps use an amazon ec2 instance which you can easily get >>>>> with up to 68 G of RAM :) >>>> >>>> With respect, and while I notice the smile, throwing memory at it is not >>>> an option for a large set of enterprise applications that might actually >>>> be willing to pay to use Neo4j if it didn't fail at the first hurdle when >>>> confronted with a trivial and small scale data load... >>>> >>>> runImport failed after 2,072 seconds.... >>>> >>>> Creating data took 316 seconds >>>> Physical mem: 1535MB, Heap size: 1016MB >>>> use_memory_mapped_buffers=false >>>> neostore.propertystore.db.index.keys.mapped_memory=1M >>>> neostore.propertystore.db.strings.mapped_memory=52M >>>> neostore.propertystore.db.arrays.mapped_memory=60M >>>> neo_store=N:\TradeModel\target\hepper\neostore >>>> neostore.relationshipstore.db.mapped_memory=76M >>>> neostore.propertystore.db.index.mapped_memory=1M >>>> neostore.propertystore.db.mapped_memory=62M >>>> dump_configuration=true >>>> cache_type=weak >>>> neostore.nodestore.db.mapped_memory=17M >>>> 1000000 nodes created. Took 59906 >>>> 2000000 nodes created. Took 64546 >>>> 3000000 nodes created. Took 74577 >>>> 4000000 nodes created. Took 82607 >>>> 5000000 nodes created. Took 171091 >>>> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: >>>> Java heap space >>>> at java.io.BufferedOutputStream.<init>(Unknown Source) >>>> at java.io.BufferedOutputStream.<init>(Unknown Source) >>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown >>>> Source) >>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown >>>> Source) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown >>>> Source) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >>>> at java.lang.Thread.run(Unknown Source) >>>> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: >>>> Java heap space >>>> at java.io.BufferedInputStream.<init>(Unknown Source) >>>> at java.io.BufferedInputStream.<init>(Unknown Source) >>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown >>>> Source) >>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown >>>> Source) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown >>>> Source) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >>>> at java.lang.Thread.run(Unknown Source) >>>> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: >>>> Java heap space >>>> at java.io.BufferedOutputStream.<init>(Unknown Source) >>>> at java.io.BufferedOutputStream.<init>(Unknown Source) >>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown >>>> Source) >>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown >>>> Source) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown >>>> Source) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >>>> at java.lang.Thread.run(Unknown Source) >>>> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: >>>> Java heap space >>>> at java.io.BufferedInputStream.<init>(Unknown Source) >>>> at java.io.BufferedInputStream.<init>(Unknown Source) >>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown >>>> Source) >>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown >>>> Source) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown >>>> Source) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >>>> at java.lang.Thread.run(Unknown Source) >>>> >>>> >>>> >>>> >>>>> So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j >>>>> + its caches. >>>>> >>>>> Of course you're free to shard you map (e.g. by first letter of the name) >>>>> and persist those maps to disk and reload them if needed. But that's an >>>>> application level concern. >>>>> If your are really limited that way wrt memory you should try Chris >>>>> Giorans implementation which will take care of that. Or you should >>>>> perhaps use an amazon ec2 instance which you can easily get with up to 68 >>>>> G of RAM :) >>>>> >>>>> Cheers >>>>> >>>>> Michael >>>>> >>>>> >>>>> P.S. As a side-note: >>>>> For the rest of the memory: >>>>> Have you tried to use weak reference cache instead of the default soft >>>>> one? >>>>> in your config.properties add >>>>> cache_type = weak >>>>> that should take care of your memory problems (and the stopping which is >>>>> actually the GC trying to reclaim memory). >>>>> >>>>> Am 09.06.2011 um 22:36 schrieb Paul Bandler: >>>>> >>>>>> I ran Michael’s example test import program with the Map replacing the >>>>>> index on my on more modestly configured machine to see whether the >>>>>> import scaling problems I have reported previously using Batchinserter >>>>>> were reproduced. They were – I gave the program 1G of heap and watched >>>>>> it run using jconsole. It ran reasonably quickly as it consumed the in >>>>>> an almost straight line until it neared its capacity then practically >>>>>> stopped for about 20 minutes after which it died with an out of memory >>>>>> error – see below. >>>>>> >>>>>> Now I’m not saying that Neo4j should necessarily go out of its way to >>>>>> support very memory constrained environments, but I do think that it is >>>>>> not unreasonable to expect its batch import mechanism not to fall over >>>>>> in this way but should rather flush its buffers or whatever without >>>>>> requiring the import application writer to shut it down and restart it >>>>>> periodically... >>>>>> >>>>>> Creating data took 331 seconds >>>>>> 1000000 nodes created. Took 29001 >>>>>> 2000000 nodes created. Took 35107 >>>>>> 3000000 nodes created. Took 35904 >>>>>> 4000000 nodes created. Took 66169 >>>>>> 5000000 nodes created. Took 63280 >>>>>> 6000000 nodes created. Took 183922 >>>>>> 7000000 nodes created. Took 258276 >>>>>> >>>>>> com.nomura.smo.rdm.neo4j.restore.Hepper >>>>>> createData(330.364seconds) >>>>>> runImport (1,485 seconds later...) >>>>>> java.lang.OutOfMemoryError: Java heap space >>>>>> at java.util.ArrayList.<init>(Unknown Source) >>>>>> at java.util.ArrayList.<init>(Unknown Source) >>>>>> at >>>>>> org.neo4j.kernel.impl.nioneo.store.PropertyRecord.<init>(PropertyRecord.java:33) >>>>>> at >>>>>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createPropertyChain(BatchInserterImpl.java:425) >>>>>> at >>>>>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createNode(BatchInserterImpl.java:143) >>>>>> at com.nomura.smo.rdm.neo4j.restore.Hepper.runImport(Hepper.java:61) >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) >>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) >>>>>> at java.lang.reflect.Method.invoke(Unknown Source) >>>>>> at >>>>>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) >>>>>> at >>>>>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) >>>>>> at >>>>>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) >>>>>> at >>>>>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) >>>>>> at >>>>>> org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79) >>>>>> at >>>>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71) >>>>>> at >>>>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49) >>>>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) >>>>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) >>>>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) >>>>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) >>>>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) >>>>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:236) >>>>>> at >>>>>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49) >>>>>> at >>>>>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) >>>>>> at >>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) >>>>>> at >>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) >>>>>> at >>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) >>>>>> at >>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) >>>>>> >>>>>> >>>>>> Regards, >>>>>> Paul Bandler >>>>>> On 9 Jun 2011, at 12:27, Michael Hunger wrote: >>>>>> >>>>>>> I recreated Daniels code in Java, mainly because some things were >>>>>>> missing from his scala example. >>>>>>> >>>>>>> You're right that the index is the bottleneck. But with your small data >>>>>>> set it should be possible to cache the 10m nodes in a heap that fits in >>>>>>> your machine. >>>>>>> >>>>>>> I ran it first with the index and had about 8 seconds / 1M nodes and >>>>>>> 320 sec/1M rels. >>>>>>> >>>>>>> Then I switched to 3G heap and a HashMap to keep the name=>node lookup >>>>>>> and it went to 2s/1M nodes and 13 down-to 3 sec for 1M rels. >>>>>>> >>>>>>> That is the approach that Chris takes only that his solution can >>>>>>> persist the map to disk and is more efficient :) >>>>>>> >>>>>>> Hope that helps. >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> package org.neo4j.load; >>>>>>> >>>>>>> import org.apache.commons.io.FileUtils; >>>>>>> import org.junit.Test; >>>>>>> import org.neo4j.graphdb.RelationshipType; >>>>>>> import org.neo4j.graphdb.index.BatchInserterIndex; >>>>>>> import org.neo4j.graphdb.index.BatchInserterIndexProvider; >>>>>>> import org.neo4j.helpers.collection.MapUtil; >>>>>>> import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider; >>>>>>> import org.neo4j.kernel.impl.batchinsert.BatchInserter; >>>>>>> import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl; >>>>>>> >>>>>>> import java.io.*; >>>>>>> import java.util.HashMap; >>>>>>> import java.util.Map; >>>>>>> import java.util.Random; >>>>>>> >>>>>>> /** >>>>>>> * @author mh >>>>>>> * @since 09.06.11 >>>>>>> */ >>>>>>> public class Hepper { >>>>>>> >>>>>>> public static final int REPORT_COUNT = Config.MILLION; >>>>>>> >>>>>>> enum MyRelationshipTypes implements RelationshipType { >>>>>>> BELONGS_TO >>>>>>> } >>>>>>> >>>>>>> public static final int COUNT = Config.MILLION * 10; >>>>>>> >>>>>>> @Test >>>>>>> public void createData() throws IOException { >>>>>>> long time = System.currentTimeMillis(); >>>>>>> final PrintWriter writer = new PrintWriter(new BufferedWriter(new >>>>>>> FileWriter("data.txt"))); >>>>>>> Random r = new Random(-1L); >>>>>>> for (int nodes = 0; nodes < COUNT; nodes++) { >>>>>>> writer.printf("%07d|%07d|%07d%n", nodes, r.nextInt(COUNT), >>>>>>> r.nextInt(COUNT)); >>>>>>> } >>>>>>> writer.close(); >>>>>>> System.out.println("Creating data took "+ (System.currentTimeMillis() >>>>>>> - time) / 1000 +" seconds"); >>>>>>> } >>>>>>> >>>>>>> @Test >>>>>>> public void runImport() throws IOException { >>>>>>> Map<String,Long> cache=new HashMap<String, Long>(COUNT); >>>>>>> final File storeDir = new File("target/hepper"); >>>>>>> FileUtils.deleteDirectory(storeDir); >>>>>>> BatchInserter inserter = new >>>>>>> BatchInserterImpl(storeDir.getAbsolutePath()); >>>>>>> final BatchInserterIndexProvider indexProvider = new >>>>>>> LuceneBatchInserterIndexProvider(inserter); >>>>>>> final BatchInserterIndex index = indexProvider.nodeIndex("pages", >>>>>>> MapUtil.stringMap("type", "exact")); >>>>>>> BufferedReader reader = new BufferedReader(new FileReader("data.txt")); >>>>>>> String line = null; >>>>>>> int nodes = 0; >>>>>>> long time = System.currentTimeMillis(); >>>>>>> long batchTime=time; >>>>>>> while ((line = reader.readLine()) != null) { >>>>>>> final String[] nodeNames = line.split("\\|"); >>>>>>> final String name = nodeNames[0]; >>>>>>> final Map<String, Object> props = MapUtil.map("name", name); >>>>>>> final long node = inserter.createNode(props); >>>>>>> //index.add(node, props); >>>>>>> cache.put(name,node); >>>>>>> nodes++; >>>>>>> if ((nodes % REPORT_COUNT) == 0) { >>>>>>> System.out.printf("%d nodes created. Took %d %n", nodes, >>>>>>> (System.currentTimeMillis() - batchTime)); >>>>>>> batchTime = System.currentTimeMillis(); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> System.out.println("Creating nodes took "+ (System.currentTimeMillis() >>>>>>> - time) / 1000); >>>>>>> index.flush(); >>>>>>> reader.close(); >>>>>>> reader = new BufferedReader(new FileReader("data.txt")); >>>>>>> int rels = 0; >>>>>>> time = System.currentTimeMillis(); >>>>>>> batchTime=time; >>>>>>> while ((line = reader.readLine()) != null) { >>>>>>> final String[] nodeNames = line.split("\\|"); >>>>>>> final String name = nodeNames[0]; >>>>>>> //final Long from = index.get("name", name).getSingle(); >>>>>>> Long from =cache.get(name); >>>>>>> for (int j = 1; j < nodeNames.length; j++) { >>>>>>> //final Long to = index.get("name", nodeNames[j]).getSingle(); >>>>>>> final Long to = cache.get(name); >>>>>>> inserter.createRelationship(from, to, >>>>>>> MyRelationshipTypes.BELONGS_TO,null); >>>>>>> } >>>>>>> rels++; >>>>>>> if ((rels % REPORT_COUNT) == 0) { >>>>>>> System.out.printf("%d relationships created. Took %d %n", >>>>>>> rels, (System.currentTimeMillis() - batchTime)); >>>>>>> batchTime = System.currentTimeMillis(); >>>>>>> } >>>>>>> } >>>>>>> System.out.println("Creating relationships took "+ >>>>>>> (System.currentTimeMillis() - time) / 1000); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> 1000000 nodes created. Took 2227 >>>>>>> 2000000 nodes created. Took 1930 >>>>>>> 3000000 nodes created. Took 1818 >>>>>>> 4000000 nodes created. Took 1966 >>>>>>> 5000000 nodes created. Took 1857 >>>>>>> 6000000 nodes created. Took 2009 >>>>>>> 7000000 nodes created. Took 2068 >>>>>>> 8000000 nodes created. Took 1991 >>>>>>> 9000000 nodes created. Took 2151 >>>>>>> 10000000 nodes created. Took 2276 >>>>>>> Creating nodes took 20 >>>>>>> 1000000 relationships created. Took 13441 >>>>>>> 2000000 relationships created. Took 12887 >>>>>>> 3000000 relationships created. Took 12922 >>>>>>> 4000000 relationships created. Took 13149 >>>>>>> 5000000 relationships created. Took 14177 >>>>>>> 6000000 relationships created. Took 3377 >>>>>>> 7000000 relationships created. Took 2932 >>>>>>> 8000000 relationships created. Took 2991 >>>>>>> 9000000 relationships created. Took 2992 >>>>>>> 10000000 relationships created. Took 2912 >>>>>>> Creating relationships took 81 >>>>>>> >>>>>>> Am 09.06.2011 um 12:51 schrieb Chris Gioran: >>>>>>> >>>>>>>> Hi Daniel, >>>>>>>> >>>>>>>> I am working currently on a tool for importing big data sets into >>>>>>>> Neo4j graphs. >>>>>>>> The main problem in such operations is that the usual index >>>>>>>> implementations are just too >>>>>>>> slow for retrieving the mapping from keys to created node ids, so a >>>>>>>> custom solution is >>>>>>>> needed, that is dependent to a varying degree on the distribution of >>>>>>>> values of the input set. >>>>>>>> >>>>>>>> While your dataset is smaller than the data sizes i deal with, i would >>>>>>>> like to use it as a test case. If you could >>>>>>>> provide somehow the actual data or something that emulates them, I >>>>>>>> would be grateful. >>>>>>>> >>>>>>>> If you want to see my approach, it is available here >>>>>>>> >>>>>>>> https://github.com/digitalstain/BigDataImport >>>>>>>> >>>>>>>> The core algorithm is an XJoin style two-level-hashing scheme with >>>>>>>> adaptable eviction strategies but it is not production ready yet, >>>>>>>> mainly from an API perspective. >>>>>>>> >>>>>>>> You can contact me directly for any details regarding this issue. >>>>>>>> >>>>>>>> cheers, >>>>>>>> CG >>>>>>>> >>>>>>>> On Thu, Jun 9, 2011 at 12:59 PM, Daniel Hepper >>>>>>>> <daniel.hep...@gmail.com> wrote: >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I'm struggling with importing a graph with about 10m nodes and 20m >>>>>>>>> relationships, with nodes having 0 to 10 relationships. Creating the >>>>>>>>> nodes takes about 10 minutes, but creating the relationships is slower >>>>>>>>> by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with >>>>>>>>> 4GB RAM and conventional HDD. >>>>>>>>> >>>>>>>>> The graph is stored as adjacency list in a text file where each line >>>>>>>>> has this form: >>>>>>>>> >>>>>>>>> Foo|Bar|Baz >>>>>>>>> (Node Foo has relations to Bar and Baz) >>>>>>>>> >>>>>>>>> My current approach is to iterate over the whole file twice. In the >>>>>>>>> first run, I create a node with the property "name" for the first >>>>>>>>> entry in the line (Foo in this case) and add it to an index. >>>>>>>>> In the second run, I get the start node and the end nodes from the >>>>>>>>> index by name and create the relationships. >>>>>>>>> >>>>>>>>> My code can be found here: http://pastie.org/2041801 >>>>>>>>> >>>>>>>>> With my approach, the best I can achieve is 100 created relationships >>>>>>>>> per second. >>>>>>>>> I experimented with mapped memory settings, but without much effect. >>>>>>>>> Is this the speed I can expect? >>>>>>>>> Any advice on how to speed up this process? >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> Daniel Hepper >>>>>>>>> _______________________________________________ >>>>>>>>> Neo4j mailing list >>>>>>>>> User@lists.neo4j.org >>>>>>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Neo4j mailing list >>>>>>>> User@lists.neo4j.org >>>>>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Neo4j mailing list >>>>>>> User@lists.neo4j.org >>>>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>>> >>>>>> _______________________________________________ >>>>>> Neo4j mailing list >>>>>> User@lists.neo4j.org >>>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>> >>>>> _______________________________________________ >>>>> Neo4j mailing list >>>>> User@lists.neo4j.org >>>>> https://lists.neo4j.org/mailman/listinfo/user >>>> >>>> _______________________________________________ >>>> Neo4j mailing list >>>> User@lists.neo4j.org >>>> https://lists.neo4j.org/mailman/listinfo/user >>> >>> _______________________________________________ >>> Neo4j mailing list >>> User@lists.neo4j.org >>> https://lists.neo4j.org/mailman/listinfo/user >> >> _______________________________________________ >> Neo4j mailing list >> User@lists.neo4j.org >> https://lists.neo4j.org/mailman/listinfo/user > > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user