Re: [Neo4j] BatchInserter improvement with 1.4M04 but still got relationship building bottleneck [was Re: Speeding up initial import of graph...]

Michael Hunger Tue, 14 Jun 2011 03:36:25 -0700

Paul,

I tried to use your implementation, it added quite a high performance penalty, 
when profiling it, it consumed 3/4 of the time for the relationship creation.
The other time was spent in acquiring persistence windows for the node-store.


I also profiled the version using the simple map.

We found one issue (IdGenerator) that could benefit from the assumption that 
BatchInsertion runs single threaded.

Lots of time was spent in acquiring persistence-windows for nodes when creating 
the relationships.

My results with no batch-buffer configuration were not so bad initially but I 
sped them up tremendously by increasing the memory_buffer size for the 
node-store (so that i more often finds the
persistence-window in the cache and doesn't have to look it up.)

so, my configuration looks like this:
                "neostore.nodestore.db.mapped_memory","250M",
                "neostore.relationshipstore.db.mapped_memory","200M", // that 
could be also lower 50M-100M
                "neostore.propertystore.db.mapped_memory","50M",
                "neostore.propertystore.db.strings.mapped_memory","50M",
                "neostore.propertystore.db.arrays.mapped_memory","0M"

Then I get the following (with the faster map-cache):
Physical mem: 16384MB, Heap size: 3055MB
use_memory_mapped_buffers=false
neostore.propertystore.db.index.keys.mapped_memory=1M
neostore.propertystore.db.strings.mapped_memory=50M
neostore.propertystore.db.arrays.mapped_memory=0M
neo_store=/Users/mh/java/neo/import/target/hepper/neostore
neostore.relationshipstore.db.mapped_memory=200M
neostore.propertystore.db.index.mapped_memory=1M
neostore.propertystore.db.mapped_memory=50M
dump_configuration=true
cache_type=weak
neostore.nodestore.db.mapped_memory=250M

1000000 nodes created. Took 2358 
2000000 nodes created. Took 2090 
3000000 nodes created. Took 2082 
4000000 nodes created. Took 2054 
5000000 nodes created. Took 2245 
6000000 nodes created. Took 2100 
7000000 nodes created. Took 2117 
8000000 nodes created. Took 2076 
9000000 nodes created. Took 2391 
10000000 nodes created. Took 2214 
Creating nodes took 21
1000000 relationships created. Took 4302 
2000000 relationships created. Took 4176 
3000000 relationships created. Took 4260 
4000000 relationships created. Took 6448 
5000000 relationships created. Took 7645 
6000000 relationships created. Took 7290 
7000000 relationships created. Took 8627 
8000000 relationships created. Took 7907 
9000000 relationships created. Took 8292 
10000000 relationships created. Took 8563 
Creating relationships took 67

>>> use_memory_mapped_buffers=false
>>> neostore.propertystore.db.index.keys.mapped_memory=1M
>>> neostore.propertystore.db.strings.mapped_memory=52M
>>> neostore.propertystore.db.arrays.mapped_memory=60M
>>> neo_store=N:\TradeModel\target\hepper\neostore
>>> neostore.relationshipstore.db.mapped_memory=76M
>>> neostore.propertystore.db.index.mapped_memory=1M
>>> neostore.propertystore.db.mapped_memory=62M
>>> dump_configuration=true
>>> cache_type=weak


Cheers

Michael

Am 13.06.2011 um 14:58 schrieb Paul Bandler:

>> can you share your test and the CompactIndex you wrote?
>> 
>> That would be great. 
> 
> See below...
> 
>> Also the memory settings (Xmx) you used for the different runs.
> 
> The heap size is displayed by neo4j is it not with console entries such as:-
> 
>>> Physical mem: 1535MB, Heap size: 1016MB
> 
> So that one came fro -Xmx1024M and 
> 
>>> Physical mem: 4096MB, Heap size: 2039MB
>> 
> 
> came from -Xms2048M
> 
> 
> regards,
> 
> Paul
> 
> 
> package com.xxx.neo4j.restore;
> 
> import java.io.ByteArrayInputStream;
> import java.io.ByteArrayOutputStream;
> import java.io.DataInputStream;
> import java.io.DataOutputStream;
> 
> public class NodeIdPair implements Comparable<NodeIdPair> {
>    private long            _node;
>    private int             _id;
>    static final NodeIdPair _prototype = new NodeIdPair(Long.MAX_VALUE,
>                                               Integer.MAX_VALUE);
>    static Integer          MY_SIZE    = null;
> 
>    public static int size() {
>        if (MY_SIZE == null) {
>            MY_SIZE = (new NodeIdPair(Long.MAX_VALUE, Integer.MAX_VALUE))
>                    .toByteArray().length;
>            System.out.println("MY_SIZE: " + MY_SIZE);
>        }
>        return MY_SIZE;
>    }
> 
>    public NodeIdPair(long node, int id) {
>        _node = node;
>        _id = id;
>    }
> 
>    public NodeIdPair(byte fromByteArray[]) {
>        ByteArrayInputStream bais = new ByteArrayInputStream(fromByteArray);
>        DataInputStream dis = new DataInputStream(bais);
>        try {
>            _node = dis.readLong();
>            _id = dis.readInt();
>        } catch (Exception e) {
>            throw new Error("Unexpected exception. byte[] len " + 
> fromByteArray.length, e);
>        }
>    }
> 
>    byte[] toByteArray() {
>        ByteArrayOutputStream bos = new ByteArrayOutputStream(
>                MY_SIZE != null ? MY_SIZE : 12);
>        DataOutputStream dos = new DataOutputStream(bos);
>        try {
>            dos.writeLong(_node);
>            dos.writeInt(_id);
>            dos.flush();
>        } catch (Exception e) {
>            throw new Error("Unexpected exception: ", e);
>        }
> 
>        return bos.toByteArray();
>    }
> 
>    @Override
>    public int compareTo(NodeIdPair arg0) {
>        return _id - arg0._id;
>    }
> 
>    public long getNode() {
>        return _node;
>    }
> 
>    public int getId() {
>        return _id;
>    }
> 
> }
> 
> 
> 
> 
> 
> package com.xxx.neo4j.restore;
> 
> import java.util.Arrays;
> import java.util.TreeSet;
> 
> public class CompactNodeIndex {
>    private int  _offSet = 0;
>    private byte _extent[];
>    private int  _slotCount;
> 
>    public CompactNodeIndex(TreeSet<NodeIdPair> sortedPairs) {
>        _extent = new byte[sortedPairs.size() * NodeIdPair.size()];
>        _slotCount = sortedPairs.size();
>        for (NodeIdPair pair : sortedPairs) {
>            byte pairBytes[] = pair.toByteArray();
>            copyToExtent(pairBytes);
>        }
>        System.out.println("CompactNodeIndex slot count: " + _slotCount);
>    }
> 
>    public NodeIdPair findNodeForId(int id) {
>        return search(id, 0, _slotCount - 1);
>    }
> 
>    @SuppressWarnings("serial")
>    static class FoundIt extends Exception {
>        NodeIdPair _result;
> 
>        FoundIt(NodeIdPair result) {
>            _result = result;
>        }
> 
>    }
> 
>    private NodeIdPair search(int soughtId, int lowerBound, int upperBound) {
>        try {
>            while (true) {
>                if ((upperBound - lowerBound) > 1) {
>                    int compareSlot = lowerBound
>                            + ((upperBound - lowerBound) / 2);
>                    int comparison = compareAt(soughtId, compareSlot);
> 
>                    if (comparison > 0) {
>                        lowerBound = compareSlot;
>                        continue;
>                    } else {
>                        upperBound = compareSlot;
>                    }
>                } else {
>                    compareAt(soughtId, upperBound);
>                    compareAt(soughtId, lowerBound);
>                    // not found it
>                    return null;
>                }
>            }
>        } catch (FoundIt result) {
>            return result._result;
>        }
>    }
> 
>    private int compareAt(int soughtId, int compareSlot) throws FoundIt {
>        NodeIdPair candidate = get(compareSlot);
>        int diff = soughtId - candidate.getId();
>        if (diff == 0)
>            throw new FoundIt(candidate);
>        return diff;
>    }
> 
>    private NodeIdPair get(int compareSlot) {
>        int startPos = compareSlot * NodeIdPair.size();
>        byte serialisedPair[] = Arrays.copyOfRange(_extent, startPos, startPos
>                + NodeIdPair.size());
> 
>        return new NodeIdPair(serialisedPair);
>    }
> 
>    private void copyToExtent(byte[] pairBytes) {
>        for (byte b : pairBytes) {
>            if (_offSet > _extent.length)
>                throw new Error("Unexpected extent overflow: " + _offSet);
>            _extent[_offSet++] = b;
>        }
>    }
> }
> 
> 
> On 13 Jun 2011, at 13:23, Michael Hunger wrote:
> 
>> Paul,
>> 
>> can you share your test and the CompactIndex you wrote?
>> 
>> That would be great. 
>> 
>> Also the memory settings (Xmx) you used for the different runs.
>> 
>> Thanks so much
>> 
>> Michael
>> 
>> Am 13.06.2011 um 14:15 schrieb Paul Bandler:
>> 
>>> Having noticed a mention in the 1.4M04 release notes that:
>>> 
>>>> Also, the BatchInserterIndex now keeps its memory usage in-check with 
>>>> batched commits of indexed data using a configurable batch commit size.
>>> 
>>> I re-ran this test using M04 and sure enough, node creation no longer eats 
>>> up the heap linearly so that is good - I should be able to remove the 
>>> periodic resetting of the BatchInserter during import.
>>> 
>>> So I returned to the issue of removing the index creation and later access 
>>> bottleneck using an application managed data structure as Michael 
>>> illustrated, but needing a solution with a smaller memory footprint I wrote 
>>> a CompactNodeIndex class for mapping integer 'id' key values to long nodes 
>>> that uses a minimum memory footprint by overlaying a binary-choppable table 
>>> onto a byte array.  Watching heap on jconsole while this ran I could see 
>>> that had the desired effect of releasing huge amounts of heap once it the 
>>> CompactNodeIndex is loaded and the source data structure gc'd.  However 
>>> when I attempted to scale the test program back up to the 10M nodes Michael 
>>> had been testing it appears to run into something of a brick wall becoming 
>>> massively I/O bound when creating the relationships.  With 1M nodes it ran 
>>> ok, with 2M nodes not too bad, but much beyond that it crawls along using 
>>> just about 1% of CPU but has loads of heap spare.
>>> 
>>> I re-ran on a more generously configured iMac (giving the test 4G of heap) 
>>> and it did much better in that it actually showed some progress building 
>>> relationships over a 10M node-set, but still exhibited massive slow down 
>>> once past 7M relationships.
>>> 
>>> Below are the test results - the question now is are there any Neo4j 
>>> parameters that might enable this I/O bottleneck that appears when building 
>>> relationships over such sized node sets with the BatchInserter...?  I note 
>>> the section in the manual on performance parameters, but I'm afraid not 
>>> being familiar enough with the Neo4j internals I don't feel that they give 
>>> enough clear information on how to set them improve the performance of this 
>>> use-case.
>>> 
>>> Thanks,
>>> 
>>> Paul
>>> 
>>> Run 1 - Windows m/c..REPORT_COUNT = MILLION/10
>>> Physical mem: 1535MB, Heap size: 1016MB
>>> use_memory_mapped_buffers=false
>>> neostore.propertystore.db.index.keys.mapped_memory=1M
>>> neostore.propertystore.db.strings.mapped_memory=52M
>>> neostore.propertystore.db.arrays.mapped_memory=60M
>>> neo_store=N:\TradeModel\target\hepper\neostore
>>> neostore.relationshipstore.db.mapped_memory=76M
>>> neostore.propertystore.db.index.mapped_memory=1M
>>> neostore.propertystore.db.mapped_memory=62M
>>> dump_configuration=true
>>> cache_type=weak
>>> neostore.nodestore.db.mapped_memory=17M
>>> 100000 nodes created. Took 2906
>>> 200000 nodes created. Took 2688
>>> 300000 nodes created. Took 2828
>>> 400000 nodes created. Took 2953
>>> 500000 nodes created. Took 2672
>>> 600000 nodes created. Took 2766
>>> 700000 nodes created. Took 2687
>>> 800000 nodes created. Took 2703
>>> 900000 nodes created. Took 2719
>>> 1000000 nodes created. Took 2641
>>> Creating nodes took 27
>>> MY_SIZE: 12
>>> CompactNodeIndex slot count: 1000000
>>> 100000 relationships created. Took 4125
>>> 200000 relationships created. Took 3953
>>> 300000 relationships created. Took 3937
>>> 400000 relationships created. Took 3610
>>> 500000 relationships created. Took 3719
>>> 600000 relationships created. Took 4328
>>> 700000 relationships created. Took 3750
>>> 800000 relationships created. Took 3609
>>> 900000 relationships created. Took 4125
>>> 1000000 relationships created. Took 3781
>>> 1100000 relationships created. Took 4125
>>> 1200000 relationships created. Took 3750
>>> 1300000 relationships created. Took 3907
>>> 1400000 relationships created. Took 4297
>>> 1500000 relationships created. Took 3703
>>> 1600000 relationships created. Took 3687
>>> 1700000 relationships created. Took 4328
>>> 1800000 relationships created. Took 3907
>>> 1900000 relationships created. Took 3718
>>> 2000000 relationships created. Took 3891
>>> Creating relationships took 78
>>> 
>>> 2M Nodes on Windows m/c:-
>>> 
>>> Creating data took 68 seconds
>>> Physical mem: 1535MB, Heap size: 1016MB
>>> use_memory_mapped_buffers=false
>>> neostore.propertystore.db.index.keys.mapped_memory=1M
>>> neostore.propertystore.db.strings.mapped_memory=52M
>>> neostore.propertystore.db.arrays.mapped_memory=60M
>>> neo_store=N:\TradeModel\target\hepper\neostore
>>> neostore.relationshipstore.db.mapped_memory=76M
>>> neostore.propertystore.db.index.mapped_memory=1M
>>> neostore.propertystore.db.mapped_memory=62M
>>> dump_configuration=true
>>> cache_type=weak
>>> neostore.nodestore.db.mapped_memory=17M
>>> 100000 nodes created. Took 3188
>>> 200000 nodes created. Took 3094
>>> 300000 nodes created. Took 3062
>>> 400000 nodes created. Took 2813
>>> 500000 nodes created. Took 2718
>>> 600000 nodes created. Took 3000
>>> 700000 nodes created. Took 2938
>>> 800000 nodes created. Took 2828
>>> 900000 nodes created. Took 4172
>>> 1000000 nodes created. Took 2859
>>> 1100000 nodes created. Took 3625
>>> 1200000 nodes created. Took 3235
>>> 1300000 nodes created. Took 2781
>>> 1400000 nodes created. Took 2891
>>> 1500000 nodes created. Took 2922
>>> 1600000 nodes created. Took 2968
>>> 1700000 nodes created. Took 3438
>>> 1800000 nodes created. Took 2687
>>> 1900000 nodes created. Took 2969
>>> 2000000 nodes created. Took 2891
>>> Creating nodes took 61
>>> MY_SIZE: 12
>>> CompactNodeIndex slot count: 2000000
>>> 100000 relationships created. Took 311377
>>> 200000 relationships created. Took 11297
>>> 300000 relationships created. Took 11062
>>> 400000 relationships created. Took 10891
>>> 500000 relationships created. Took 11109
>>> 600000 relationships created. Took 11375
>>> 700000 relationships created. Took 11266
>>> 800000 relationships created. Took 26469
>>> 900000 relationships created. Took 46875
>>> 1000000 relationships created. Took 12047
>>> 1100000 relationships created. Took 43016
>>> 1200000 relationships created. Took 12110
>>> 1300000 relationships created. Took 12625
>>> 1400000 relationships created. Took 12031
>>> 1500000 relationships created. Took 40375
>>> 1600000 relationships created. Took 11328
>>> 1700000 relationships created. Took 11125
>>> 1800000 relationships created. Took 10891
>>> 1900000 relationships created. Took 11266
>>> 2000000 relationships created. Took 11125
>>> 2100000 relationships created. Took 11281
>>> 2200000 relationships created. Took 11156
>>> 2300000 relationships created. Took 11250
>>> 2400000 relationships created. Took 11735
>>> 2500000 relationships created. Took 15984
>>> 2600000 relationships created. Took 16766
>>> 2700000 relationships created. Took 71969
>>> 2800000 relationships created. Took 205283
>>> 2900000 relationships created. Took 159236
>>> 3000000 relationships created. Took 32734
>>> 3100000 relationships created. Took 149064
>>> 3200000 relationships created. Took 116391
>>> 3300000 relationships created. Took 74079
>>> 3400000 relationships created. Took 43360
>>> 3500000 relationships created. Took 20500
>>> 3600000 relationships created. Took 246704
>>> 3700000 relationships created. Took 74407
>>> 3800000 relationships created. Took 189611
>>> 3900000 relationships created. Took 44922
>>> 4000000 relationships created. Took 482675
>>> Creating relationships took 2628
>>> 
>>> iMac (REPORT_COUNT = MILLION)
>>> Physical mem: 4096MB, Heap size: 2039MB
>>> use_memory_mapped_buffers=false
>>> neostore.propertystore.db.index.keys.mapped_memory=1M
>>> neostore.propertystore.db.strings.mapped_memory=106M
>>> neostore.propertystore.db.arrays.mapped_memory=120M
>>> neo_store=/Users/paulbandler/Documents/workspace/Neo4jImport/target/hepper/neostore
>>> neostore.relationshipstore.db.mapped_memory=152M
>>> neostore.propertystore.db.index.mapped_memory=1M
>>> neostore.propertystore.db.mapped_memory=124M
>>> dump_configuration=true
>>> cache_type=weak
>>> neostore.nodestore.db.mapped_memory=34M
>>> 1000000 nodes created. Took 2817 
>>> 2000000 nodes created. Took 2407 
>>> 3000000 nodes created. Took 2086 
>>> 4000000 nodes created. Took 2303 
>>> 5000000 nodes created. Took 2912 
>>> 6000000 nodes created. Took 2178 
>>> 7000000 nodes created. Took 2241 
>>> 8000000 nodes created. Took 2453 
>>> 9000000 nodes created. Took 2627 
>>> 10000000 nodes created. Took 3996 
>>> Creating nodes took 26
>>> MY_SIZE: 12
>>> CompactNodeIndex slot count: 10000000
>>> 1000000 relationships created. Took 198784 
>>> 2000000 relationships created. Took 24203 
>>> 3000000 relationships created. Took 25313 
>>> 4000000 relationships created. Took 22177 
>>> 5000000 relationships created. Took 22406 
>>> 6000000 relationships created. Took 84977 
>>> 7000000 relationships created. Took 402123 
>>> 8000000 relationships created. Took 1342290 
>>> 
>>> 
>>> On 10 Jun 2011, at 08:27, Michael Hunger wrote:
>>> 
>>>> You're right the lucene based import shouldn't fail for memory problems, I 
>>>> will look into that.
>>>> 
>>>> My suggestion is valid if you want to use an in memory map to speed up the 
>>>> import. And if you're able to perhaps analyze / partition your data that 
>>>> might be a viable solution.
>>>> 
>>>> Will get back to you with the findings later.
>>>> 
>>>> Michael
>>>> 
>>>> Am 10.06.2011 um 09:02 schrieb Paul Bandler:
>>>> 
>>>>> 
>>>>> On 9 Jun 2011, at 22:12, Michael Hunger wrote:
>>>>> 
>>>>>> Please keep in mind that the HashMap of 10M strings -> longs will take a 
>>>>>> substantial amount of heap memory.
>>>>>> That's not the fault of Neo4j :) On my system it alone takes 1.8 G of 
>>>>>> memory (distributed across the strings, the hashmap-entries and the 
>>>>>> longs).
>>>>> 
>>>>> 
>>>>> Fair enough,  but removing the Map and using the Index instead and 
>>>>> setting the cache_type to weak makes almost no difference to the programs 
>>>>> behaviour in terms of progressively consuming the heap until it fails.  I 
>>>>> did this, including removal of the allocation of the Map, and watched to 
>>>>> heap consumption follow a similar pattern until it failed as below.
>>>>> 
>>>>>> Or you should perhaps use an amazon ec2 instance which you can easily 
>>>>>> get with up to 68 G of RAM :)
>>>>> 
>>>>> With respect, and while I notice the smile, throwing memory at it is not 
>>>>> an option for a large set of enterprise applications that might actually 
>>>>> be willing to pay to use Neo4j if it didn't fail at the first hurdle when 
>>>>> confronted with a trivial and small scale data load...
>>>>> 
>>>>> runImport failed after 2,072 seconds....
>>>>> 
>>>>> Creating data took 316 seconds
>>>>> Physical mem: 1535MB, Heap size: 1016MB
>>>>> use_memory_mapped_buffers=false
>>>>> neostore.propertystore.db.index.keys.mapped_memory=1M
>>>>> neostore.propertystore.db.strings.mapped_memory=52M
>>>>> neostore.propertystore.db.arrays.mapped_memory=60M
>>>>> neo_store=N:\TradeModel\target\hepper\neostore
>>>>> neostore.relationshipstore.db.mapped_memory=76M
>>>>> neostore.propertystore.db.index.mapped_memory=1M
>>>>> neostore.propertystore.db.mapped_memory=62M
>>>>> dump_configuration=true
>>>>> cache_type=weak
>>>>> neostore.nodestore.db.mapped_memory=17M
>>>>> 1000000 nodes created. Took 59906
>>>>> 2000000 nodes created. Took 64546
>>>>> 3000000 nodes created. Took 74577
>>>>> 4000000 nodes created. Took 82607
>>>>> 5000000 nodes created. Took 171091
>>>>> Exception in thread "RMI TCP Connection(idle)" 
>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>    at java.io.BufferedOutputStream.<init>(Unknown Source)
>>>>>    at java.io.BufferedOutputStream.<init>(Unknown Source)
>>>>>    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
>>>>> Source)
>>>>>    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
>>>>> Source)
>>>>>    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>>>>> Source)
>>>>>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>>>    at java.lang.Thread.run(Unknown Source)
>>>>> Exception in thread "RMI TCP Connection(idle)" 
>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>    at java.io.BufferedInputStream.<init>(Unknown Source)
>>>>>    at java.io.BufferedInputStream.<init>(Unknown Source)
>>>>>    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
>>>>> Source)
>>>>>    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
>>>>> Source)
>>>>>    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>>>>> Source)
>>>>>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>>>    at java.lang.Thread.run(Unknown Source)
>>>>> Exception in thread "RMI TCP Connection(idle)" 
>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>    at java.io.BufferedOutputStream.<init>(Unknown Source)
>>>>>    at java.io.BufferedOutputStream.<init>(Unknown Source)
>>>>>    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
>>>>> Source)
>>>>>    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
>>>>> Source)
>>>>>    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>>>>> Source)
>>>>>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>>>    at java.lang.Thread.run(Unknown Source)
>>>>> Exception in thread "RMI TCP Connection(idle)" 
>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>    at java.io.BufferedInputStream.<init>(Unknown Source)
>>>>>    at java.io.BufferedInputStream.<init>(Unknown Source)
>>>>>    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
>>>>> Source)
>>>>>    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
>>>>> Source)
>>>>>    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>>>>> Source)
>>>>>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>>>    at java.lang.Thread.run(Unknown Source)
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j 
>>>>>> + its caches.
>>>>>> 
>>>>>> Of course you're free to shard you map (e.g. by first letter of the 
>>>>>> name) and persist those maps to disk and reload them if needed. But 
>>>>>> that's an application level concern.
>>>>>> If your are really limited that way wrt memory you should try Chris 
>>>>>> Giorans implementation which will take care of that. Or you should 
>>>>>> perhaps use an amazon ec2 instance which you can easily get with up to 
>>>>>> 68 G of RAM :)
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> Michael
>>>>>> 
>>>>>> 
>>>>>> P.S. As a side-note:
>>>>>> For the rest of the memory:
>>>>>> Have you tried to use weak reference cache instead of the default soft 
>>>>>> one?
>>>>>> in your config.properties add
>>>>>> cache_type = weak
>>>>>> that should take care of your memory problems (and the stopping which is 
>>>>>> actually the GC trying to reclaim memory).
>>>>>> 
>>>>>> Am 09.06.2011 um 22:36 schrieb Paul Bandler:
>>>>>> 
>>>>>>> I ran Michael’s  example test import program with the Map replacing the 
>>>>>>> index on my on more modestly configured machine to see whether the 
>>>>>>> import scaling problems I have reported previously using Batchinserter 
>>>>>>> were reproduced.  They were – I gave the program 1G of heap and watched 
>>>>>>> it run using jconsole.  It ran reasonably quickly as it consumed the in 
>>>>>>> an almost straight line until it neared its capacity then practically 
>>>>>>> stopped for about 20 minutes after which it died with an out of memory 
>>>>>>> error – see below.
>>>>>>> 
>>>>>>> Now I’m not saying that Neo4j should necessarily go out of its way to 
>>>>>>> support very memory constrained environments, but I do think that it is 
>>>>>>> not unreasonable to expect its batch import mechanism not to fall over 
>>>>>>> in this way but should rather flush its buffers or whatever without 
>>>>>>> requiring the import application writer to shut it down and restart it 
>>>>>>> periodically...
>>>>>>> 
>>>>>>> Creating data took 331 seconds
>>>>>>> 1000000 nodes created. Took 29001
>>>>>>> 2000000 nodes created. Took 35107
>>>>>>> 3000000 nodes created. Took 35904
>>>>>>> 4000000 nodes created. Took 66169
>>>>>>> 5000000 nodes created. Took 63280
>>>>>>> 6000000 nodes created. Took 183922
>>>>>>> 7000000 nodes created. Took 258276
>>>>>>> 
>>>>>>> com.nomura.smo.rdm.neo4j.restore.Hepper
>>>>>>> createData(330.364seconds)
>>>>>>> runImport (1,485 seconds later...)
>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>>  at java.util.ArrayList.<init>(Unknown Source)
>>>>>>>  at java.util.ArrayList.<init>(Unknown Source)
>>>>>>>  at 
>>>>>>> org.neo4j.kernel.impl.nioneo.store.PropertyRecord.<init>(PropertyRecord.java:33)
>>>>>>>  at 
>>>>>>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createPropertyChain(BatchInserterImpl.java:425)
>>>>>>>  at 
>>>>>>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createNode(BatchInserterImpl.java:143)
>>>>>>>  at com.nomura.smo.rdm.neo4j.restore.Hepper.runImport(Hepper.java:61)
>>>>>>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>  at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>>>>>>>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>>>>>>  at java.lang.reflect.Method.invoke(Unknown Source)
>>>>>>>  at 
>>>>>>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>>>>>>>  at 
>>>>>>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>>>>>>>  at 
>>>>>>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>>>>>>>  at 
>>>>>>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>>>>>>>  at 
>>>>>>> org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
>>>>>>>  at 
>>>>>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
>>>>>>>  at 
>>>>>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
>>>>>>>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>>>>>>>  at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>>>>>>>  at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>>>>>>>  at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>>>>>>>  at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>>>>>>>  at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>>>>>>>  at 
>>>>>>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49)
>>>>>>>  at 
>>>>>>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>>>>>  at 
>>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
>>>>>>>  at 
>>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
>>>>>>>  at 
>>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
>>>>>>>  at 
>>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
>>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Paul Bandler 
>>>>>>> On 9 Jun 2011, at 12:27, Michael Hunger wrote:
>>>>>>> 
>>>>>>>> I recreated Daniels code in Java, mainly because some things were 
>>>>>>>> missing from his scala example.
>>>>>>>> 
>>>>>>>> You're right that the index is the bottleneck. But with your small 
>>>>>>>> data set it should be possible to cache the 10m nodes in a heap that 
>>>>>>>> fits in your machine.
>>>>>>>> 
>>>>>>>> I ran it first with the index and had about 8 seconds / 1M nodes and 
>>>>>>>> 320 sec/1M rels.
>>>>>>>> 
>>>>>>>> Then I switched to 3G heap and a HashMap to keep the name=>node lookup 
>>>>>>>> and it went to 2s/1M nodes and 13 down-to 3 sec for 1M rels.
>>>>>>>> 
>>>>>>>> That is the approach that Chris takes only that his solution can 
>>>>>>>> persist the map to disk and is more efficient :)
>>>>>>>> 
>>>>>>>> Hope that helps.
>>>>>>>> 
>>>>>>>> Michael
>>>>>>>> 
>>>>>>>> package org.neo4j.load;
>>>>>>>> 
>>>>>>>> import org.apache.commons.io.FileUtils;
>>>>>>>> import org.junit.Test;
>>>>>>>> import org.neo4j.graphdb.RelationshipType;
>>>>>>>> import org.neo4j.graphdb.index.BatchInserterIndex;
>>>>>>>> import org.neo4j.graphdb.index.BatchInserterIndexProvider;
>>>>>>>> import org.neo4j.helpers.collection.MapUtil;
>>>>>>>> import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider;
>>>>>>>> import org.neo4j.kernel.impl.batchinsert.BatchInserter;
>>>>>>>> import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;
>>>>>>>> 
>>>>>>>> import java.io.*;
>>>>>>>> import java.util.HashMap;
>>>>>>>> import java.util.Map;
>>>>>>>> import java.util.Random;
>>>>>>>> 
>>>>>>>> /**
>>>>>>>> * @author mh
>>>>>>>> * @since 09.06.11
>>>>>>>> */
>>>>>>>> public class Hepper {
>>>>>>>> 
>>>>>>>> public static final int REPORT_COUNT = Config.MILLION;
>>>>>>>> 
>>>>>>>> enum MyRelationshipTypes implements RelationshipType {
>>>>>>>> BELONGS_TO
>>>>>>>> }
>>>>>>>> 
>>>>>>>> public static final int COUNT = Config.MILLION * 10;
>>>>>>>> 
>>>>>>>> @Test
>>>>>>>> public void createData() throws IOException {
>>>>>>>> long time = System.currentTimeMillis();
>>>>>>>> final PrintWriter writer = new PrintWriter(new BufferedWriter(new 
>>>>>>>> FileWriter("data.txt")));
>>>>>>>> Random r = new Random(-1L);
>>>>>>>> for (int nodes = 0; nodes < COUNT; nodes++) {
>>>>>>>>     writer.printf("%07d|%07d|%07d%n", nodes, r.nextInt(COUNT), 
>>>>>>>> r.nextInt(COUNT));
>>>>>>>> }
>>>>>>>> writer.close();
>>>>>>>> System.out.println("Creating data took "+ (System.currentTimeMillis() 
>>>>>>>> - time) / 1000 +" seconds");
>>>>>>>> }
>>>>>>>> 
>>>>>>>> @Test
>>>>>>>> public void runImport() throws IOException {
>>>>>>>> Map<String,Long> cache=new HashMap<String, Long>(COUNT);
>>>>>>>> final File storeDir = new File("target/hepper");
>>>>>>>> FileUtils.deleteDirectory(storeDir);
>>>>>>>> BatchInserter inserter = new 
>>>>>>>> BatchInserterImpl(storeDir.getAbsolutePath());
>>>>>>>> final BatchInserterIndexProvider indexProvider = new 
>>>>>>>> LuceneBatchInserterIndexProvider(inserter);
>>>>>>>> final BatchInserterIndex index = indexProvider.nodeIndex("pages", 
>>>>>>>> MapUtil.stringMap("type", "exact"));
>>>>>>>> BufferedReader reader = new BufferedReader(new FileReader("data.txt"));
>>>>>>>> String line = null;
>>>>>>>> int nodes = 0;
>>>>>>>> long time = System.currentTimeMillis();
>>>>>>>> long batchTime=time;
>>>>>>>> while ((line = reader.readLine()) != null) {
>>>>>>>>     final String[] nodeNames = line.split("\\|");
>>>>>>>>     final String name = nodeNames[0];
>>>>>>>>     final Map<String, Object> props = MapUtil.map("name", name);
>>>>>>>>     final long node = inserter.createNode(props);
>>>>>>>>     //index.add(node, props);
>>>>>>>>     cache.put(name,node);
>>>>>>>>     nodes++;
>>>>>>>>     if ((nodes % REPORT_COUNT) == 0) {
>>>>>>>>         System.out.printf("%d nodes created. Took %d %n", nodes, 
>>>>>>>> (System.currentTimeMillis() - batchTime));
>>>>>>>>         batchTime = System.currentTimeMillis();
>>>>>>>>     }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> System.out.println("Creating nodes took "+ (System.currentTimeMillis() 
>>>>>>>> - time) / 1000);
>>>>>>>> index.flush();
>>>>>>>> reader.close();
>>>>>>>> reader = new BufferedReader(new FileReader("data.txt"));
>>>>>>>> int rels = 0;
>>>>>>>> time = System.currentTimeMillis();
>>>>>>>> batchTime=time;
>>>>>>>> while ((line = reader.readLine()) != null) {
>>>>>>>>     final String[] nodeNames = line.split("\\|");
>>>>>>>>     final String name = nodeNames[0];
>>>>>>>>     //final Long from = index.get("name", name).getSingle();
>>>>>>>>     Long from =cache.get(name);
>>>>>>>>     for (int j = 1; j < nodeNames.length; j++) {
>>>>>>>>         //final Long to = index.get("name", nodeNames[j]).getSingle();
>>>>>>>>         final Long to = cache.get(name);
>>>>>>>>         inserter.createRelationship(from, to, 
>>>>>>>> MyRelationshipTypes.BELONGS_TO,null);
>>>>>>>>     }
>>>>>>>>     rels++;
>>>>>>>>     if ((rels % REPORT_COUNT) == 0) {
>>>>>>>>         System.out.printf("%d relationships created. Took %d %n", 
>>>>>>>> rels, (System.currentTimeMillis() - batchTime));
>>>>>>>>         batchTime = System.currentTimeMillis();
>>>>>>>>     }
>>>>>>>> }
>>>>>>>> System.out.println("Creating relationships took "+ 
>>>>>>>> (System.currentTimeMillis() - time) / 1000);
>>>>>>>> }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 1000000 nodes created. Took 2227 
>>>>>>>> 2000000 nodes created. Took 1930 
>>>>>>>> 3000000 nodes created. Took 1818 
>>>>>>>> 4000000 nodes created. Took 1966 
>>>>>>>> 5000000 nodes created. Took 1857 
>>>>>>>> 6000000 nodes created. Took 2009 
>>>>>>>> 7000000 nodes created. Took 2068 
>>>>>>>> 8000000 nodes created. Took 1991 
>>>>>>>> 9000000 nodes created. Took 2151 
>>>>>>>> 10000000 nodes created. Took 2276 
>>>>>>>> Creating nodes took 20
>>>>>>>> 1000000 relationships created. Took 13441 
>>>>>>>> 2000000 relationships created. Took 12887 
>>>>>>>> 3000000 relationships created. Took 12922 
>>>>>>>> 4000000 relationships created. Took 13149 
>>>>>>>> 5000000 relationships created. Took 14177 
>>>>>>>> 6000000 relationships created. Took 3377 
>>>>>>>> 7000000 relationships created. Took 2932 
>>>>>>>> 8000000 relationships created. Took 2991 
>>>>>>>> 9000000 relationships created. Took 2992 
>>>>>>>> 10000000 relationships created. Took 2912 
>>>>>>>> Creating relationships took 81
>>>>>>>> 
>>>>>>>> Am 09.06.2011 um 12:51 schrieb Chris Gioran:
>>>>>>>> 
>>>>>>>>> Hi Daniel,
>>>>>>>>> 
>>>>>>>>> I am working currently on a tool for importing big data sets into 
>>>>>>>>> Neo4j graphs.
>>>>>>>>> The main problem in such operations is that the usual index
>>>>>>>>> implementations are just too
>>>>>>>>> slow for retrieving the mapping from keys to created node ids, so a
>>>>>>>>> custom solution is
>>>>>>>>> needed, that is dependent to a varying degree on the distribution of
>>>>>>>>> values of the input set.
>>>>>>>>> 
>>>>>>>>> While your dataset is smaller than the data sizes i deal with, i would
>>>>>>>>> like to use it as a test case. If you could
>>>>>>>>> provide somehow the actual data or something that emulates them, I
>>>>>>>>> would be grateful.
>>>>>>>>> 
>>>>>>>>> If you want to see my approach, it is available here
>>>>>>>>> 
>>>>>>>>> https://github.com/digitalstain/BigDataImport
>>>>>>>>> 
>>>>>>>>> The core algorithm is an XJoin style two-level-hashing scheme with
>>>>>>>>> adaptable eviction strategies but it is not production ready yet,
>>>>>>>>> mainly from an API perspective.
>>>>>>>>> 
>>>>>>>>> You can contact me directly for any details regarding this issue.
>>>>>>>>> 
>>>>>>>>> cheers,
>>>>>>>>> CG
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 9, 2011 at 12:59 PM, Daniel Hepper 
>>>>>>>>> <daniel.hep...@gmail.com> wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> I'm struggling with importing a graph with about 10m nodes and 20m
>>>>>>>>>> relationships, with nodes having 0 to 10 relationships. Creating the
>>>>>>>>>> nodes takes about 10 minutes, but creating the relationships is 
>>>>>>>>>> slower
>>>>>>>>>> by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro 
>>>>>>>>>> with
>>>>>>>>>> 4GB RAM and conventional HDD.
>>>>>>>>>> 
>>>>>>>>>> The graph is stored as adjacency list in a text file where each line
>>>>>>>>>> has this form:
>>>>>>>>>> 
>>>>>>>>>> Foo|Bar|Baz
>>>>>>>>>> (Node Foo has relations to Bar and Baz)
>>>>>>>>>> 
>>>>>>>>>> My current approach is to iterate over the whole file twice. In the
>>>>>>>>>> first run, I create a node with the property "name" for the first
>>>>>>>>>> entry in the line (Foo in this case) and add it to an index.
>>>>>>>>>> In the second run, I get the start node and the end nodes from the
>>>>>>>>>> index by name and create the relationships.
>>>>>>>>>> 
>>>>>>>>>> My code can be found here: http://pastie.org/2041801
>>>>>>>>>> 
>>>>>>>>>> With my approach, the best I can achieve is 100 created relationships
>>>>>>>>>> per second.
>>>>>>>>>> I experimented with mapped memory settings, but without much effect.
>>>>>>>>>> Is this the speed I can expect?
>>>>>>>>>> Any advice on how to speed up this process?
>>>>>>>>>> 
>>>>>>>>>> Best regards,
>>>>>>>>>> Daniel Hepper
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Neo4j mailing list
>>>>>>>>>> User@lists.neo4j.org
>>>>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Neo4j mailing list
>>>>>>>>> User@lists.neo4j.org
>>>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Neo4j mailing list
>>>>>>>> User@lists.neo4j.org
>>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Neo4j mailing list
>>>>>>> User@lists.neo4j.org
>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Neo4j mailing list
>>>>>> User@lists.neo4j.org
>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>> 
>>>>> _______________________________________________
>>>>> Neo4j mailing list
>>>>> User@lists.neo4j.org
>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>> 
>>>> _______________________________________________
>>>> Neo4j mailing list
>>>> User@lists.neo4j.org
>>>> https://lists.neo4j.org/mailman/listinfo/user
>>> 
>>> _______________________________________________
>>> Neo4j mailing list
>>> User@lists.neo4j.org
>>> https://lists.neo4j.org/mailman/listinfo/user
>> 
>> _______________________________________________
>> Neo4j mailing list
>> User@lists.neo4j.org
>> https://lists.neo4j.org/mailman/listinfo/user
> 
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] BatchInserter improvement with 1.4M04 but still got relationship building bottleneck [was Re: Speeding up initial import of graph...]

Reply via email to