I have a pipeline that creates a graph then does some transformations on it (with Giraph). In the end I want to dump it into Neo4j to allow for cypher queries.
I was told that I could make the batch import for Neo4j a lot faster if I would use Long identifiers without holes, and therefore matching there internal ID space. If I understand it right they use it to build an on disk index with it using the ID's as offsets, that's why it should have no holes. I didn't expect it to be so costly to change the index, but I guess this way I could at least spread the load to the cluster, since batch import happens on a single machine. Thanks 4 the input, I will see what makes the most sense with the limited time I have. On Tue, Apr 15, 2014 at 5:31 PM, Lukas Nalezenec < lukas.naleze...@firma.seznam.cz> wrote: > Hi, > I did same think in two M/R jobs during preprocesing - it was pretty > powerful for web graphs but little bit slow. > > Solution for Giraph is: > 1. Implement own partition which will iterate vertices in order. Use > appropriate partitioner. > 2. During first iteration you need to rename vertexes in each partition > without holes. Holes will be only between partitions. > At the end, get min and max vertex index for each partion, send it to > master in aggregator and compute mapping required to delete holes. > 3. During second iteration iterate all vertexes and delete holes by > shifting vertex indexes. > > 4. .... rename edges (two more iterations)... > > Btw: Why do you need such indexes ? For HLL ? > > Lukas > > > On 15.4.2014 15:33, Martin Neumann wrote: > > Hej, > > I have a huge edgelist (several billion edges) where node ID's are URL's. > The algorithm I want to run needs the ID's to be long and there should be > no holes in the ID space (so I cant simply hash the URL's). > > Is anyone aware of a simple solution that does not require a impractical > huge hash map? > > My idea currently is to load the graph into another giraph job and then > assigning a number to each node. This way the mapping of number to URL > would be stored in the Node. > Problem is that I have to assign the numbers in a sequential way to ensure > there are no holes and numbers are unique. No Idea if this is even possible > in Giraph. > > Any input is welcome > > cheers Martin > > >