the most performant way to insert data with the BatchInserter is to
first insert the nodes only form your node file (that should be fast).
After that (or at the same time), find a way to generate the
relationship file with Neo4j IDs rather than being forced to look the
nodes up in indexes during relationship insertion. This is taking the
bulk of time, so if you could write back to a file your node IDs, then
massage the relationship text file to include node FROM and TO IDs
(e.g. using Perl or Bash or Ruby) and import that one refering to
these directly, that should be much faster.



/peter neubauer

GTalk:      neubauer.peter
Skype       peter.neubauer
Phone       +46 704 106975
Twitter               - Your high performance graph database.    - Öresund - Innovation happens HERE. - Scandinavia's coolest Bring-a-Thing party.

On Tue, Sep 20, 2011 at 12:23 PM, st3ven <> wrote:
> Hello neo4j-comunity,
> I am creating a graph database for a social network.
> To create the graph database I am using the Batch Inserter.
> The Batch Inserter inserts data from 2 files into the graph database.
> Files:
> 1. the first file contains the Nodes I want to create (about 3.5M Nodes)
> The file looks like this:
> Author 1
> Author 2
> Author 2 ...
> 2. the second file contains every Relationship between the Nodes (about 2.5
> billion Relationships)
> This file looks like this:
> Author1; Author2; timestamp
> Author2; Author3; timestamp
> Author1; Author3; timestamp...
> The specifications of my Computer look like this:
> Intel Core i7 3,4Ghz
> 16GB Ram
> Geforce GT 420 1GB
> 2TB harddrive
> My Code to create the graph database looks like this:
> package wikiOSN;
> import;
> import;
> import;
> import java.util.Map;
> import org.neo4j.graphdb.DynamicRelationshipType;
> import org.neo4j.graphdb.index.BatchInserterIndex;
> import org.neo4j.graphdb.index.BatchInserterIndexProvider;
> import org.neo4j.helpers.collection.MapUtil;
> import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider;
> import org.neo4j.kernel.impl.batchinsert.BatchInserter;
> import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;
> public class CreateAndConnectNodes {
>        public static void main(String[] args) throws IOException {
>                BufferedReader bf = new BufferedReader(new FileReader(
>                                "/media/sdg1/Wikipedia/Reduced 
> Files/autoren-der-wikiartikel"));
>                BufferedReader bf2 = new BufferedReader(new FileReader(
>                                "/media/sdg1/Wikipedia/Reduced 
> Files/wikipedia-output"));
>                CreateAndConnectNodes cacn = new CreateAndConnectNodes();
>                cacn.createGraphDatabase(bf, bf2);
>        }
>        private long relationCounter = 0;
>        private void createGraphDatabase(BufferedReader bf, BufferedReader bf2)
>                        throws IOException {
>                BatchInserter inserter = new BatchInserterImpl(
>                                "target/socialNetwork-batchinsert");
>                BatchInserterIndexProvider indexProvider = new
> LuceneBatchInserterIndexProvider(
>                                inserter);
>                BatchInserterIndex authors = indexProvider.nodeIndex("author",
>                                MapUtil.stringMap("type", "exact"));
>                authors.setCacheCapacity("name", 100000);
>                String zeile;
>                String zeile2;
>                while ((zeile = bf.readLine()) != null) {
>                        Map&lt;String, Object&gt; properties = 
>"name", zeile);
>                        long node = inserter.createNode(properties);
>                        authors.add(node, properties);
>                }
>                bf.close();
>                System.out.println("Nodes created!");
>                authors.flush();
>                String node = "";
>                long node1 = 0;
>                long node2 = 0;
>                while ((zeile2 = bf2.readLine()) != null) {
>                        if (relationCounter++ % 100000000 == 0) {
>                                System.out
>                                                .println("Edges already 
> created: " + relationCounter);
>                        }
>                        String[] relation = zeile2.split("%;% ");
>                        if (node == "") {
>                                node = relation[0];
>                                if (authors.get("name", 
> relation[0]).getSingle() != null) {
>                                        node1 = authors.get("name", 
> relation[0]).getSingle();
>                                } else {
>                                        System.out.println("Autor 1: " + 
> relation[0]);
>                                        break;
>                                }
>                        }
>                        if (!node.equals(relation[0])) {
>                                node = relation[0];
>                                if (authors.get("name", 
> relation[0]).getSingle() != null) {
>                                        node1 = authors.get("name", 
> relation[0]).getSingle();
>                                } else {
>                                        System.out.println("Autor 1: " + 
> relation[0]);
>                                        break;
>                                }
>                        }
>                        if (authors.get("name", relation[1]).getSingle() != 
> null) {
>                                node2 = authors.get("name", 
> relation[1]).getSingle();
>                        } else {
>                                System.out.println("Autor 2: " + relation[1]);
>                                break;
>                        }
>                        Map&lt;String, Object&gt; properties = 
>                                        Long.parseLong(relation[2].trim()));
>                        inserter.createRelationship(node1, node2,
> DynamicRelationshipType.withName("KNOWS"), properties);
>                }
>                System.out.println("Edges created!!!");
>                bf2.close();
>                indexProvider.shutdown();
>                inserter.shutdown();
>        }
> }
> I want to know if there is any better way to create such a big database or
> am I doing it correctly?
> Can I maybe optimize the import for traversals I want to do or is this the
> standard sort of import?
> The Java heapsize for the insert here was -Xmx8G.
> After I had created the graph database I wanted to get the node degree of
> every node.
> To get the node degree I created the following code:
> package wikiOSN;
> import;
> import java.util.Date;
> import java.util.Iterator;
> import java.util.Map;
> import org.neo4j.graphdb.GraphDatabaseService;
> import org.neo4j.graphdb.Node;
> import org.neo4j.graphdb.Relationship;
> import org.neo4j.kernel.EmbeddedGraphDatabase;
> public class NodeDegree {
>        public static void main(String[] args) throws IOException {
>                NodeDegree nd = new NodeDegree();
>                nd.getNodeDegree();
>                System.out.println("NodeDegree calculated!!!");
>                Date date = new Date();
>                date.setTime(System.currentTimeMillis());
>                System.out.println(date);
>        }
>        private GraphDatabaseService db;
>        private int counter;
>        private void getNodeDegree() throws IOException {
>                db = new 
> EmbeddedGraphDatabase("target/socialNetwork-batchinsert");
>                for (Node node : db.getAllNodes()) {
>                        counter = 0;
>                        if (node.getId() > 0) {
>                                for (Relationship rel : 
> node.getRelationships()) {
>                                        counter++;
>                                }
> System.out.println(node.getProperty("name").toString() + ": "
>                                                + counter);
>                        }
>                }
>                db.shutdown();
>        }
> }
> The problem here is, that after 3 days I only got the node degree for 80000
> nodes.
> That is a huge amount of time for only 80000 nodes. What am I doing wrong
> here?
> I also tried to tune my traversal, but it is still very slow.
> How can I optimize that, so that I get the node degree only in one day for
> 3.5M nodes?
> Do I have to change something at the import of the data or is there a better
> way for getting the node degree?
> Thank you very much for your help!
> Greetings,
> Stephan
> --
> View this message in context: 
> Sent from the Neo4j Community Discussions mailing list archive at
> _______________________________________________
> Neo4j mailing list
Neo4j mailing list

Reply via email to