Re: [Neo4j] BatchInserter improvement with 1.4M04 but still got relationship building bottleneck [was Re: Speeding up initial import of graph...]

Paul Bandler Mon, 13 Jun 2011 05:58:53 -0700

> can you share your test and the CompactIndex you wrote?
> 
> That would be great.


See below...

> Also the memory settings (Xmx) you used for the different runs.

The heap size is displayed by neo4j is it not with console entries such as:-

>> Physical mem: 1535MB, Heap size: 1016MB

So that one came fro -Xmx1024M and 

>> Physical mem: 4096MB, Heap size: 2039MB
> 

came from -Xms2048M


regards,

Paul


package com.xxx.neo4j.restore;
 
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
 
public class NodeIdPair implements Comparable<NodeIdPair> {
    private long            _node;
    private int             _id;
    static final NodeIdPair _prototype = new NodeIdPair(Long.MAX_VALUE,
                                               Integer.MAX_VALUE);
    static Integer          MY_SIZE    = null;
 
    public static int size() {
        if (MY_SIZE == null) {
            MY_SIZE = (new NodeIdPair(Long.MAX_VALUE, Integer.MAX_VALUE))
                    .toByteArray().length;
            System.out.println("MY_SIZE: " + MY_SIZE);
        }
        return MY_SIZE;
    }
 
    public NodeIdPair(long node, int id) {
        _node = node;
        _id = id;
    }
 
    public NodeIdPair(byte fromByteArray[]) {
        ByteArrayInputStream bais = new ByteArrayInputStream(fromByteArray);
        DataInputStream dis = new DataInputStream(bais);
        try {
            _node = dis.readLong();
            _id = dis.readInt();
        } catch (Exception e) {
            throw new Error("Unexpected exception. byte[] len " + 
fromByteArray.length, e);
        }
    }
 
    byte[] toByteArray() {
        ByteArrayOutputStream bos = new ByteArrayOutputStream(
                MY_SIZE != null ? MY_SIZE : 12);
        DataOutputStream dos = new DataOutputStream(bos);
        try {
            dos.writeLong(_node);
            dos.writeInt(_id);
            dos.flush();
        } catch (Exception e) {
            throw new Error("Unexpected exception: ", e);
        }
 
        return bos.toByteArray();
    }
 
    @Override
    public int compareTo(NodeIdPair arg0) {
        return _id - arg0._id;
    }
 
    public long getNode() {
        return _node;
    }
 
    public int getId() {
        return _id;
    }
 
}
 




package com.xxx.neo4j.restore;
 
import java.util.Arrays;
import java.util.TreeSet;
 
public class CompactNodeIndex {
    private int  _offSet = 0;
    private byte _extent[];
    private int  _slotCount;
 
    public CompactNodeIndex(TreeSet<NodeIdPair> sortedPairs) {
        _extent = new byte[sortedPairs.size() * NodeIdPair.size()];
        _slotCount = sortedPairs.size();
        for (NodeIdPair pair : sortedPairs) {
            byte pairBytes[] = pair.toByteArray();
            copyToExtent(pairBytes);
        }
        System.out.println("CompactNodeIndex slot count: " + _slotCount);
    }
 
    public NodeIdPair findNodeForId(int id) {
        return search(id, 0, _slotCount - 1);
    }
 
    @SuppressWarnings("serial")
    static class FoundIt extends Exception {
        NodeIdPair _result;
 
        FoundIt(NodeIdPair result) {
            _result = result;
        }
 
    }
 
    private NodeIdPair search(int soughtId, int lowerBound, int upperBound) {
        try {
            while (true) {
                if ((upperBound - lowerBound) > 1) {
                    int compareSlot = lowerBound
                            + ((upperBound - lowerBound) / 2);
                    int comparison = compareAt(soughtId, compareSlot);
 
                    if (comparison > 0) {
                        lowerBound = compareSlot;
                        continue;
                    } else {
                        upperBound = compareSlot;
                    }
                } else {
                    compareAt(soughtId, upperBound);
                    compareAt(soughtId, lowerBound);
                    // not found it
                    return null;
                }
            }
        } catch (FoundIt result) {
            return result._result;
        }
    }
 
    private int compareAt(int soughtId, int compareSlot) throws FoundIt {
        NodeIdPair candidate = get(compareSlot);
        int diff = soughtId - candidate.getId();
        if (diff == 0)
            throw new FoundIt(candidate);
        return diff;
    }
 
    private NodeIdPair get(int compareSlot) {
        int startPos = compareSlot * NodeIdPair.size();
        byte serialisedPair[] = Arrays.copyOfRange(_extent, startPos, startPos
                + NodeIdPair.size());
 
        return new NodeIdPair(serialisedPair);
    }
 
    private void copyToExtent(byte[] pairBytes) {
        for (byte b : pairBytes) {
            if (_offSet > _extent.length)
                throw new Error("Unexpected extent overflow: " + _offSet);
            _extent[_offSet++] = b;
        }
    }
}
 

On 13 Jun 2011, at 13:23, Michael Hunger wrote:

> Paul,
> 
> can you share your test and the CompactIndex you wrote?
> 
> That would be great. 
> 
> Also the memory settings (Xmx) you used for the different runs.
> 
> Thanks so much
> 
> Michael
> 
> Am 13.06.2011 um 14:15 schrieb Paul Bandler:
> 
>> Having noticed a mention in the 1.4M04 release notes that:
>> 
>>> Also, the BatchInserterIndex now keeps its memory usage in-check with 
>>> batched commits of indexed data using a configurable batch commit size.
>> 
>> I re-ran this test using M04 and sure enough, node creation no longer eats 
>> up the heap linearly so that is good - I should be able to remove the 
>> periodic resetting of the BatchInserter during import.
>> 
>> So I returned to the issue of removing the index creation and later access 
>> bottleneck using an application managed data structure as Michael 
>> illustrated, but needing a solution with a smaller memory footprint I wrote 
>> a CompactNodeIndex class for mapping integer 'id' key values to long nodes 
>> that uses a minimum memory footprint by overlaying a binary-choppable table 
>> onto a byte array.  Watching heap on jconsole while this ran I could see 
>> that had the desired effect of releasing huge amounts of heap once it the 
>> CompactNodeIndex is loaded and the source data structure gc'd.  However when 
>> I attempted to scale the test program back up to the 10M nodes Michael had 
>> been testing it appears to run into something of a brick wall becoming 
>> massively I/O bound when creating the relationships.  With 1M nodes it ran 
>> ok, with 2M nodes not too bad, but much beyond that it crawls along using 
>> just about 1% of CPU but has loads of heap spare.
>> 
>> I re-ran on a more generously configured iMac (giving the test 4G of heap) 
>> and it did much better in that it actually showed some progress building 
>> relationships over a 10M node-set, but still exhibited massive slow down 
>> once past 7M relationships.
>> 
>> Below are the test results - the question now is are there any Neo4j 
>> parameters that might enable this I/O bottleneck that appears when building 
>> relationships over such sized node sets with the BatchInserter...?  I note 
>> the section in the manual on performance parameters, but I'm afraid not 
>> being familiar enough with the Neo4j internals I don't feel that they give 
>> enough clear information on how to set them improve the performance of this 
>> use-case.
>> 
>> Thanks,
>> 
>> Paul
>> 
>> Run 1 - Windows m/c..REPORT_COUNT = MILLION/10
>> Physical mem: 1535MB, Heap size: 1016MB
>> use_memory_mapped_buffers=false
>> neostore.propertystore.db.index.keys.mapped_memory=1M
>> neostore.propertystore.db.strings.mapped_memory=52M
>> neostore.propertystore.db.arrays.mapped_memory=60M
>> neo_store=N:\TradeModel\target\hepper\neostore
>> neostore.relationshipstore.db.mapped_memory=76M
>> neostore.propertystore.db.index.mapped_memory=1M
>> neostore.propertystore.db.mapped_memory=62M
>> dump_configuration=true
>> cache_type=weak
>> neostore.nodestore.db.mapped_memory=17M
>> 100000 nodes created. Took 2906
>> 200000 nodes created. Took 2688
>> 300000 nodes created. Took 2828
>> 400000 nodes created. Took 2953
>> 500000 nodes created. Took 2672
>> 600000 nodes created. Took 2766
>> 700000 nodes created. Took 2687
>> 800000 nodes created. Took 2703
>> 900000 nodes created. Took 2719
>> 1000000 nodes created. Took 2641
>> Creating nodes took 27
>> MY_SIZE: 12
>> CompactNodeIndex slot count: 1000000
>> 100000 relationships created. Took 4125
>> 200000 relationships created. Took 3953
>> 300000 relationships created. Took 3937
>> 400000 relationships created. Took 3610
>> 500000 relationships created. Took 3719
>> 600000 relationships created. Took 4328
>> 700000 relationships created. Took 3750
>> 800000 relationships created. Took 3609
>> 900000 relationships created. Took 4125
>> 1000000 relationships created. Took 3781
>> 1100000 relationships created. Took 4125
>> 1200000 relationships created. Took 3750
>> 1300000 relationships created. Took 3907
>> 1400000 relationships created. Took 4297
>> 1500000 relationships created. Took 3703
>> 1600000 relationships created. Took 3687
>> 1700000 relationships created. Took 4328
>> 1800000 relationships created. Took 3907
>> 1900000 relationships created. Took 3718
>> 2000000 relationships created. Took 3891
>> Creating relationships took 78
>> 
>> 2M Nodes on Windows m/c:-
>> 
>> Creating data took 68 seconds
>> Physical mem: 1535MB, Heap size: 1016MB
>> use_memory_mapped_buffers=false
>> neostore.propertystore.db.index.keys.mapped_memory=1M
>> neostore.propertystore.db.strings.mapped_memory=52M
>> neostore.propertystore.db.arrays.mapped_memory=60M
>> neo_store=N:\TradeModel\target\hepper\neostore
>> neostore.relationshipstore.db.mapped_memory=76M
>> neostore.propertystore.db.index.mapped_memory=1M
>> neostore.propertystore.db.mapped_memory=62M
>> dump_configuration=true
>> cache_type=weak
>> neostore.nodestore.db.mapped_memory=17M
>> 100000 nodes created. Took 3188
>> 200000 nodes created. Took 3094
>> 300000 nodes created. Took 3062
>> 400000 nodes created. Took 2813
>> 500000 nodes created. Took 2718
>> 600000 nodes created. Took 3000
>> 700000 nodes created. Took 2938
>> 800000 nodes created. Took 2828
>> 900000 nodes created. Took 4172
>> 1000000 nodes created. Took 2859
>> 1100000 nodes created. Took 3625
>> 1200000 nodes created. Took 3235
>> 1300000 nodes created. Took 2781
>> 1400000 nodes created. Took 2891
>> 1500000 nodes created. Took 2922
>> 1600000 nodes created. Took 2968
>> 1700000 nodes created. Took 3438
>> 1800000 nodes created. Took 2687
>> 1900000 nodes created. Took 2969
>> 2000000 nodes created. Took 2891
>> Creating nodes took 61
>> MY_SIZE: 12
>> CompactNodeIndex slot count: 2000000
>> 100000 relationships created. Took 311377
>> 200000 relationships created. Took 11297
>> 300000 relationships created. Took 11062
>> 400000 relationships created. Took 10891
>> 500000 relationships created. Took 11109
>> 600000 relationships created. Took 11375
>> 700000 relationships created. Took 11266
>> 800000 relationships created. Took 26469
>> 900000 relationships created. Took 46875
>> 1000000 relationships created. Took 12047
>> 1100000 relationships created. Took 43016
>> 1200000 relationships created. Took 12110
>> 1300000 relationships created. Took 12625
>> 1400000 relationships created. Took 12031
>> 1500000 relationships created. Took 40375
>> 1600000 relationships created. Took 11328
>> 1700000 relationships created. Took 11125
>> 1800000 relationships created. Took 10891
>> 1900000 relationships created. Took 11266
>> 2000000 relationships created. Took 11125
>> 2100000 relationships created. Took 11281
>> 2200000 relationships created. Took 11156
>> 2300000 relationships created. Took 11250
>> 2400000 relationships created. Took 11735
>> 2500000 relationships created. Took 15984
>> 2600000 relationships created. Took 16766
>> 2700000 relationships created. Took 71969
>> 2800000 relationships created. Took 205283
>> 2900000 relationships created. Took 159236
>> 3000000 relationships created. Took 32734
>> 3100000 relationships created. Took 149064
>> 3200000 relationships created. Took 116391
>> 3300000 relationships created. Took 74079
>> 3400000 relationships created. Took 43360
>> 3500000 relationships created. Took 20500
>> 3600000 relationships created. Took 246704
>> 3700000 relationships created. Took 74407
>> 3800000 relationships created. Took 189611
>> 3900000 relationships created. Took 44922
>> 4000000 relationships created. Took 482675
>> Creating relationships took 2628
>> 
>> iMac (REPORT_COUNT = MILLION)
>> Physical mem: 4096MB, Heap size: 2039MB
>> use_memory_mapped_buffers=false
>> neostore.propertystore.db.index.keys.mapped_memory=1M
>> neostore.propertystore.db.strings.mapped_memory=106M
>> neostore.propertystore.db.arrays.mapped_memory=120M
>> neo_store=/Users/paulbandler/Documents/workspace/Neo4jImport/target/hepper/neostore
>> neostore.relationshipstore.db.mapped_memory=152M
>> neostore.propertystore.db.index.mapped_memory=1M
>> neostore.propertystore.db.mapped_memory=124M
>> dump_configuration=true
>> cache_type=weak
>> neostore.nodestore.db.mapped_memory=34M
>> 1000000 nodes created. Took 2817 
>> 2000000 nodes created. Took 2407 
>> 3000000 nodes created. Took 2086 
>> 4000000 nodes created. Took 2303 
>> 5000000 nodes created. Took 2912 
>> 6000000 nodes created. Took 2178 
>> 7000000 nodes created. Took 2241 
>> 8000000 nodes created. Took 2453 
>> 9000000 nodes created. Took 2627 
>> 10000000 nodes created. Took 3996 
>> Creating nodes took 26
>> MY_SIZE: 12
>> CompactNodeIndex slot count: 10000000
>> 1000000 relationships created. Took 198784 
>> 2000000 relationships created. Took 24203 
>> 3000000 relationships created. Took 25313 
>> 4000000 relationships created. Took 22177 
>> 5000000 relationships created. Took 22406 
>> 6000000 relationships created. Took 84977 
>> 7000000 relationships created. Took 402123 
>> 8000000 relationships created. Took 1342290 
>> 
>> 
>> On 10 Jun 2011, at 08:27, Michael Hunger wrote:
>> 
>>> You're right the lucene based import shouldn't fail for memory problems, I 
>>> will look into that.
>>> 
>>> My suggestion is valid if you want to use an in memory map to speed up the 
>>> import. And if you're able to perhaps analyze / partition your data that 
>>> might be a viable solution.
>>> 
>>> Will get back to you with the findings later.
>>> 
>>> Michael
>>> 
>>> Am 10.06.2011 um 09:02 schrieb Paul Bandler:
>>> 
>>>> 
>>>> On 9 Jun 2011, at 22:12, Michael Hunger wrote:
>>>> 
>>>>> Please keep in mind that the HashMap of 10M strings -> longs will take a 
>>>>> substantial amount of heap memory.
>>>>> That's not the fault of Neo4j :) On my system it alone takes 1.8 G of 
>>>>> memory (distributed across the strings, the hashmap-entries and the 
>>>>> longs).
>>>> 
>>>> 
>>>> Fair enough,  but removing the Map and using the Index instead and setting 
>>>> the cache_type to weak makes almost no difference to the programs 
>>>> behaviour in terms of progressively consuming the heap until it fails.  I 
>>>> did this, including removal of the allocation of the Map, and watched to 
>>>> heap consumption follow a similar pattern until it failed as below.
>>>> 
>>>>> Or you should perhaps use an amazon ec2 instance which you can easily get 
>>>>> with up to 68 G of RAM :)
>>>> 
>>>> With respect, and while I notice the smile, throwing memory at it is not 
>>>> an option for a large set of enterprise applications that might actually 
>>>> be willing to pay to use Neo4j if it didn't fail at the first hurdle when 
>>>> confronted with a trivial and small scale data load...
>>>> 
>>>> runImport failed after 2,072 seconds....
>>>> 
>>>> Creating data took 316 seconds
>>>> Physical mem: 1535MB, Heap size: 1016MB
>>>> use_memory_mapped_buffers=false
>>>> neostore.propertystore.db.index.keys.mapped_memory=1M
>>>> neostore.propertystore.db.strings.mapped_memory=52M
>>>> neostore.propertystore.db.arrays.mapped_memory=60M
>>>> neo_store=N:\TradeModel\target\hepper\neostore
>>>> neostore.relationshipstore.db.mapped_memory=76M
>>>> neostore.propertystore.db.index.mapped_memory=1M
>>>> neostore.propertystore.db.mapped_memory=62M
>>>> dump_configuration=true
>>>> cache_type=weak
>>>> neostore.nodestore.db.mapped_memory=17M
>>>> 1000000 nodes created. Took 59906
>>>> 2000000 nodes created. Took 64546
>>>> 3000000 nodes created. Took 74577
>>>> 4000000 nodes created. Took 82607
>>>> 5000000 nodes created. Took 171091
>>>> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: 
>>>> Java heap space
>>>>     at java.io.BufferedOutputStream.<init>(Unknown Source)
>>>>     at java.io.BufferedOutputStream.<init>(Unknown Source)
>>>>     at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
>>>> Source)
>>>>     at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
>>>> Source)
>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>>>> Source)
>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>>     at java.lang.Thread.run(Unknown Source)
>>>> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: 
>>>> Java heap space
>>>>     at java.io.BufferedInputStream.<init>(Unknown Source)
>>>>     at java.io.BufferedInputStream.<init>(Unknown Source)
>>>>     at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
>>>> Source)
>>>>     at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
>>>> Source)
>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>>>> Source)
>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>>     at java.lang.Thread.run(Unknown Source)
>>>> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: 
>>>> Java heap space
>>>>     at java.io.BufferedOutputStream.<init>(Unknown Source)
>>>>     at java.io.BufferedOutputStream.<init>(Unknown Source)
>>>>     at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
>>>> Source)
>>>>     at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
>>>> Source)
>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>>>> Source)
>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>>     at java.lang.Thread.run(Unknown Source)
>>>> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: 
>>>> Java heap space
>>>>     at java.io.BufferedInputStream.<init>(Unknown Source)
>>>>     at java.io.BufferedInputStream.<init>(Unknown Source)
>>>>     at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
>>>> Source)
>>>>     at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
>>>> Source)
>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>>>> Source)
>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>>     at java.lang.Thread.run(Unknown Source)
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j 
>>>>> + its caches.
>>>>> 
>>>>> Of course you're free to shard you map (e.g. by first letter of the name) 
>>>>> and persist those maps to disk and reload them if needed. But that's an 
>>>>> application level concern.
>>>>> If your are really limited that way wrt memory you should try Chris 
>>>>> Giorans implementation which will take care of that. Or you should 
>>>>> perhaps use an amazon ec2 instance which you can easily get with up to 68 
>>>>> G of RAM :)
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> Michael
>>>>> 
>>>>> 
>>>>> P.S. As a side-note:
>>>>> For the rest of the memory:
>>>>> Have you tried to use weak reference cache instead of the default soft 
>>>>> one?
>>>>> in your config.properties add
>>>>> cache_type = weak
>>>>> that should take care of your memory problems (and the stopping which is 
>>>>> actually the GC trying to reclaim memory).
>>>>> 
>>>>> Am 09.06.2011 um 22:36 schrieb Paul Bandler:
>>>>> 
>>>>>> I ran Michael’s  example test import program with the Map replacing the 
>>>>>> index on my on more modestly configured machine to see whether the 
>>>>>> import scaling problems I have reported previously using Batchinserter 
>>>>>> were reproduced.  They were – I gave the program 1G of heap and watched 
>>>>>> it run using jconsole.  It ran reasonably quickly as it consumed the in 
>>>>>> an almost straight line until it neared its capacity then practically 
>>>>>> stopped for about 20 minutes after which it died with an out of memory 
>>>>>> error – see below.
>>>>>> 
>>>>>> Now I’m not saying that Neo4j should necessarily go out of its way to 
>>>>>> support very memory constrained environments, but I do think that it is 
>>>>>> not unreasonable to expect its batch import mechanism not to fall over 
>>>>>> in this way but should rather flush its buffers or whatever without 
>>>>>> requiring the import application writer to shut it down and restart it 
>>>>>> periodically...
>>>>>> 
>>>>>> Creating data took 331 seconds
>>>>>> 1000000 nodes created. Took 29001
>>>>>> 2000000 nodes created. Took 35107
>>>>>> 3000000 nodes created. Took 35904
>>>>>> 4000000 nodes created. Took 66169
>>>>>> 5000000 nodes created. Took 63280
>>>>>> 6000000 nodes created. Took 183922
>>>>>> 7000000 nodes created. Took 258276
>>>>>> 
>>>>>> com.nomura.smo.rdm.neo4j.restore.Hepper
>>>>>> createData(330.364seconds)
>>>>>> runImport (1,485 seconds later...)
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>   at java.util.ArrayList.<init>(Unknown Source)
>>>>>>   at java.util.ArrayList.<init>(Unknown Source)
>>>>>>   at 
>>>>>> org.neo4j.kernel.impl.nioneo.store.PropertyRecord.<init>(PropertyRecord.java:33)
>>>>>>   at 
>>>>>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createPropertyChain(BatchInserterImpl.java:425)
>>>>>>   at 
>>>>>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createNode(BatchInserterImpl.java:143)
>>>>>>   at com.nomura.smo.rdm.neo4j.restore.Hepper.runImport(Hepper.java:61)
>>>>>>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>>>>>>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>>>>>   at java.lang.reflect.Method.invoke(Unknown Source)
>>>>>>   at 
>>>>>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>>>>>>   at 
>>>>>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>>>>>>   at 
>>>>>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>>>>>>   at 
>>>>>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>>>>>>   at 
>>>>>> org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
>>>>>>   at 
>>>>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
>>>>>>   at 
>>>>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
>>>>>>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>>>>>>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>>>>>>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>>>>>>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>>>>>>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>>>>>>   at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>>>>>>   at 
>>>>>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49)
>>>>>>   at 
>>>>>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>>>>   at 
>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
>>>>>>   at 
>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
>>>>>>   at 
>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
>>>>>>   at 
>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Paul Bandler 
>>>>>> On 9 Jun 2011, at 12:27, Michael Hunger wrote:
>>>>>> 
>>>>>>> I recreated Daniels code in Java, mainly because some things were 
>>>>>>> missing from his scala example.
>>>>>>> 
>>>>>>> You're right that the index is the bottleneck. But with your small data 
>>>>>>> set it should be possible to cache the 10m nodes in a heap that fits in 
>>>>>>> your machine.
>>>>>>> 
>>>>>>> I ran it first with the index and had about 8 seconds / 1M nodes and 
>>>>>>> 320 sec/1M rels.
>>>>>>> 
>>>>>>> Then I switched to 3G heap and a HashMap to keep the name=>node lookup 
>>>>>>> and it went to 2s/1M nodes and 13 down-to 3 sec for 1M rels.
>>>>>>> 
>>>>>>> That is the approach that Chris takes only that his solution can 
>>>>>>> persist the map to disk and is more efficient :)
>>>>>>> 
>>>>>>> Hope that helps.
>>>>>>> 
>>>>>>> Michael
>>>>>>> 
>>>>>>> package org.neo4j.load;
>>>>>>> 
>>>>>>> import org.apache.commons.io.FileUtils;
>>>>>>> import org.junit.Test;
>>>>>>> import org.neo4j.graphdb.RelationshipType;
>>>>>>> import org.neo4j.graphdb.index.BatchInserterIndex;
>>>>>>> import org.neo4j.graphdb.index.BatchInserterIndexProvider;
>>>>>>> import org.neo4j.helpers.collection.MapUtil;
>>>>>>> import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider;
>>>>>>> import org.neo4j.kernel.impl.batchinsert.BatchInserter;
>>>>>>> import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;
>>>>>>> 
>>>>>>> import java.io.*;
>>>>>>> import java.util.HashMap;
>>>>>>> import java.util.Map;
>>>>>>> import java.util.Random;
>>>>>>> 
>>>>>>> /**
>>>>>>> * @author mh
>>>>>>> * @since 09.06.11
>>>>>>> */
>>>>>>> public class Hepper {
>>>>>>> 
>>>>>>> public static final int REPORT_COUNT = Config.MILLION;
>>>>>>> 
>>>>>>> enum MyRelationshipTypes implements RelationshipType {
>>>>>>>  BELONGS_TO
>>>>>>> }
>>>>>>> 
>>>>>>> public static final int COUNT = Config.MILLION * 10;
>>>>>>> 
>>>>>>> @Test
>>>>>>> public void createData() throws IOException {
>>>>>>>  long time = System.currentTimeMillis();
>>>>>>>  final PrintWriter writer = new PrintWriter(new BufferedWriter(new 
>>>>>>> FileWriter("data.txt")));
>>>>>>>  Random r = new Random(-1L);
>>>>>>>  for (int nodes = 0; nodes < COUNT; nodes++) {
>>>>>>>      writer.printf("%07d|%07d|%07d%n", nodes, r.nextInt(COUNT), 
>>>>>>> r.nextInt(COUNT));
>>>>>>>  }
>>>>>>>  writer.close();
>>>>>>>  System.out.println("Creating data took "+ (System.currentTimeMillis() 
>>>>>>> - time) / 1000 +" seconds");
>>>>>>> }
>>>>>>> 
>>>>>>> @Test
>>>>>>> public void runImport() throws IOException {
>>>>>>>  Map<String,Long> cache=new HashMap<String, Long>(COUNT);
>>>>>>>  final File storeDir = new File("target/hepper");
>>>>>>>  FileUtils.deleteDirectory(storeDir);
>>>>>>>  BatchInserter inserter = new 
>>>>>>> BatchInserterImpl(storeDir.getAbsolutePath());
>>>>>>>  final BatchInserterIndexProvider indexProvider = new 
>>>>>>> LuceneBatchInserterIndexProvider(inserter);
>>>>>>>  final BatchInserterIndex index = indexProvider.nodeIndex("pages", 
>>>>>>> MapUtil.stringMap("type", "exact"));
>>>>>>>  BufferedReader reader = new BufferedReader(new FileReader("data.txt"));
>>>>>>>  String line = null;
>>>>>>>  int nodes = 0;
>>>>>>>  long time = System.currentTimeMillis();
>>>>>>>  long batchTime=time;
>>>>>>>  while ((line = reader.readLine()) != null) {
>>>>>>>      final String[] nodeNames = line.split("\\|");
>>>>>>>      final String name = nodeNames[0];
>>>>>>>      final Map<String, Object> props = MapUtil.map("name", name);
>>>>>>>      final long node = inserter.createNode(props);
>>>>>>>      //index.add(node, props);
>>>>>>>      cache.put(name,node);
>>>>>>>      nodes++;
>>>>>>>      if ((nodes % REPORT_COUNT) == 0) {
>>>>>>>          System.out.printf("%d nodes created. Took %d %n", nodes, 
>>>>>>> (System.currentTimeMillis() - batchTime));
>>>>>>>          batchTime = System.currentTimeMillis();
>>>>>>>      }
>>>>>>>  }
>>>>>>> 
>>>>>>>  System.out.println("Creating nodes took "+ (System.currentTimeMillis() 
>>>>>>> - time) / 1000);
>>>>>>>  index.flush();
>>>>>>>  reader.close();
>>>>>>>  reader = new BufferedReader(new FileReader("data.txt"));
>>>>>>>  int rels = 0;
>>>>>>>  time = System.currentTimeMillis();
>>>>>>>  batchTime=time;
>>>>>>>  while ((line = reader.readLine()) != null) {
>>>>>>>      final String[] nodeNames = line.split("\\|");
>>>>>>>      final String name = nodeNames[0];
>>>>>>>      //final Long from = index.get("name", name).getSingle();
>>>>>>>      Long from =cache.get(name);
>>>>>>>      for (int j = 1; j < nodeNames.length; j++) {
>>>>>>>          //final Long to = index.get("name", nodeNames[j]).getSingle();
>>>>>>>          final Long to = cache.get(name);
>>>>>>>          inserter.createRelationship(from, to, 
>>>>>>> MyRelationshipTypes.BELONGS_TO,null);
>>>>>>>      }
>>>>>>>      rels++;
>>>>>>>      if ((rels % REPORT_COUNT) == 0) {
>>>>>>>          System.out.printf("%d relationships created. Took %d %n", 
>>>>>>> rels, (System.currentTimeMillis() - batchTime));
>>>>>>>          batchTime = System.currentTimeMillis();
>>>>>>>      }
>>>>>>>  }
>>>>>>>  System.out.println("Creating relationships took "+ 
>>>>>>> (System.currentTimeMillis() - time) / 1000);
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> 1000000 nodes created. Took 2227 
>>>>>>> 2000000 nodes created. Took 1930 
>>>>>>> 3000000 nodes created. Took 1818 
>>>>>>> 4000000 nodes created. Took 1966 
>>>>>>> 5000000 nodes created. Took 1857 
>>>>>>> 6000000 nodes created. Took 2009 
>>>>>>> 7000000 nodes created. Took 2068 
>>>>>>> 8000000 nodes created. Took 1991 
>>>>>>> 9000000 nodes created. Took 2151 
>>>>>>> 10000000 nodes created. Took 2276 
>>>>>>> Creating nodes took 20
>>>>>>> 1000000 relationships created. Took 13441 
>>>>>>> 2000000 relationships created. Took 12887 
>>>>>>> 3000000 relationships created. Took 12922 
>>>>>>> 4000000 relationships created. Took 13149 
>>>>>>> 5000000 relationships created. Took 14177 
>>>>>>> 6000000 relationships created. Took 3377 
>>>>>>> 7000000 relationships created. Took 2932 
>>>>>>> 8000000 relationships created. Took 2991 
>>>>>>> 9000000 relationships created. Took 2992 
>>>>>>> 10000000 relationships created. Took 2912 
>>>>>>> Creating relationships took 81
>>>>>>> 
>>>>>>> Am 09.06.2011 um 12:51 schrieb Chris Gioran:
>>>>>>> 
>>>>>>>> Hi Daniel,
>>>>>>>> 
>>>>>>>> I am working currently on a tool for importing big data sets into 
>>>>>>>> Neo4j graphs.
>>>>>>>> The main problem in such operations is that the usual index
>>>>>>>> implementations are just too
>>>>>>>> slow for retrieving the mapping from keys to created node ids, so a
>>>>>>>> custom solution is
>>>>>>>> needed, that is dependent to a varying degree on the distribution of
>>>>>>>> values of the input set.
>>>>>>>> 
>>>>>>>> While your dataset is smaller than the data sizes i deal with, i would
>>>>>>>> like to use it as a test case. If you could
>>>>>>>> provide somehow the actual data or something that emulates them, I
>>>>>>>> would be grateful.
>>>>>>>> 
>>>>>>>> If you want to see my approach, it is available here
>>>>>>>> 
>>>>>>>> https://github.com/digitalstain/BigDataImport
>>>>>>>> 
>>>>>>>> The core algorithm is an XJoin style two-level-hashing scheme with
>>>>>>>> adaptable eviction strategies but it is not production ready yet,
>>>>>>>> mainly from an API perspective.
>>>>>>>> 
>>>>>>>> You can contact me directly for any details regarding this issue.
>>>>>>>> 
>>>>>>>> cheers,
>>>>>>>> CG
>>>>>>>> 
>>>>>>>> On Thu, Jun 9, 2011 at 12:59 PM, Daniel Hepper 
>>>>>>>> <daniel.hep...@gmail.com> wrote:
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> I'm struggling with importing a graph with about 10m nodes and 20m
>>>>>>>>> relationships, with nodes having 0 to 10 relationships. Creating the
>>>>>>>>> nodes takes about 10 minutes, but creating the relationships is slower
>>>>>>>>> by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with
>>>>>>>>> 4GB RAM and conventional HDD.
>>>>>>>>> 
>>>>>>>>> The graph is stored as adjacency list in a text file where each line
>>>>>>>>> has this form:
>>>>>>>>> 
>>>>>>>>> Foo|Bar|Baz
>>>>>>>>> (Node Foo has relations to Bar and Baz)
>>>>>>>>> 
>>>>>>>>> My current approach is to iterate over the whole file twice. In the
>>>>>>>>> first run, I create a node with the property "name" for the first
>>>>>>>>> entry in the line (Foo in this case) and add it to an index.
>>>>>>>>> In the second run, I get the start node and the end nodes from the
>>>>>>>>> index by name and create the relationships.
>>>>>>>>> 
>>>>>>>>> My code can be found here: http://pastie.org/2041801
>>>>>>>>> 
>>>>>>>>> With my approach, the best I can achieve is 100 created relationships
>>>>>>>>> per second.
>>>>>>>>> I experimented with mapped memory settings, but without much effect.
>>>>>>>>> Is this the speed I can expect?
>>>>>>>>> Any advice on how to speed up this process?
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> Daniel Hepper
>>>>>>>>> _______________________________________________
>>>>>>>>> Neo4j mailing list
>>>>>>>>> User@lists.neo4j.org
>>>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Neo4j mailing list
>>>>>>>> User@lists.neo4j.org
>>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Neo4j mailing list
>>>>>>> User@lists.neo4j.org
>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Neo4j mailing list
>>>>>> User@lists.neo4j.org
>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>> 
>>>>> _______________________________________________
>>>>> Neo4j mailing list
>>>>> User@lists.neo4j.org
>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>> 
>>>> _______________________________________________
>>>> Neo4j mailing list
>>>> User@lists.neo4j.org
>>>> https://lists.neo4j.org/mailman/listinfo/user
>>> 
>>> _______________________________________________
>>> Neo4j mailing list
>>> User@lists.neo4j.org
>>> https://lists.neo4j.org/mailman/listinfo/user
>> 
>> _______________________________________________
>> Neo4j mailing list
>> User@lists.neo4j.org
>> https://lists.neo4j.org/mailman/listinfo/user
> 
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] BatchInserter improvement with 1.4M04 but still got relationship building bottleneck [was Re: Speeding up initial import of graph...]

Reply via email to