It totally depends on the input distribution, one very simple thing that can be done is:> Define a VertexResolver that upon every vertex creation sets its Id = domain of url & value = "set" of urls in the domain; it keeps appending as more vertices with same id (i.e., domain) are read from input> [Now you can ignore edges all together. All you are left with is these huge-vertices that are identified by domains & contain value = set of urls] Here, you can use aggregator approach of sending in the (domain, count of set) to master -> these aggregators are then combined to give something like [(domain1, offset1), (domain2, offset2), etc.] all vertices (the huge ones) read this aggregator and figure out their offset then while u output just output the vertices in the set with number = offset + number in set So u have a map now -> though it is highly unstable because adding one more url to a domain later will change the order totally, that is when u can use id = domain + insert date, etc. [[which will stop working at some point because aggregator needs to carry huge messages, then the computation of offsets via aggregators needs to be done in multiple supersteps, etc.]] Anyway now that you have the map of url - numberAll you got to do is a join ->that's simpleread your original table + this map table in a single giraph joband u can use 2 supersteps to rename all the vertices properly
[[note that you can do another thing here as well, MUCH SIMPLER THAN ABOVE]> superstep 0in your compute class have a thread local variable that increases for each vertex the thread computes, assign the value [(workerid,threadid) , number] to each vertex. now aggregate {(workerid, threadid), number} > superstep 1;master>now we have see [{(workerid, threadid), count in group}]so > recompute another aggregator which is like[{workerid, threadid), cumulative > sum upto now]send this aggregator to workers workerread cumulative_sum from aggregator and add it up to each vertex's current value when you output the graph this time as edgeoutput, sourceid, targetid are set as vertex values = the count Date: Tue, 15 Apr 2014 23:40:39 +0200 Subject: Re: Changing index of a graph From: mneum...@spotify.com To: user@giraph.apache.org I have a pipeline that creates a graph then does some transformations on it (with Giraph). In the end I want to dump it into Neo4j to allow for cypher queries. I was told that I could make the batch import for Neo4j a lot faster if I would use Long identifiers without holes, and therefore matching there internal ID space. If I understand it right they use it to build an on disk index with it using the ID's as offsets, that's why it should have no holes. I didn't expect it to be so costly to change the index, but I guess this way I could at least spread the load to the cluster, since batch import happens on a single machine. Thanks 4 the input, I will see what makes the most sense with the limited time I have. On Tue, Apr 15, 2014 at 5:31 PM, Lukas Nalezenec <lukas.naleze...@firma.seznam.cz> wrote: Hi, I did same think in two M/R jobs during preprocesing - it was pretty powerful for web graphs but little bit slow. Solution for Giraph is: 1. Implement own partition which will iterate vertices in order. Use appropriate partitioner. 2. During first iteration you need to rename vertexes in each partition without holes. Holes will be only between partitions. At the end, get min and max vertex index for each partion, send it to master in aggregator and compute mapping required to delete holes. 3. During second iteration iterate all vertexes and delete holes by shifting vertex indexes. 4. .... rename edges (two more iterations)... Btw: Why do you need such indexes ? For HLL ? Lukas On 15.4.2014 15:33, Martin Neumann wrote: Hej, I have a huge edgelist (several billion edges) where node ID's are URL's. The algorithm I want to run needs the ID's to be long and there should be no holes in the ID space (so I cant simply hash the URL's). Is anyone aware of a simple solution that does not require a impractical huge hash map? My idea currently is to load the graph into another giraph job and then assigning a number to each node. This way the mapping of number to URL would be stored in the Node. Problem is that I have to assign the numbers in a sequential way to ensure there are no holes and numbers are unique. No Idea if this is even possible in Giraph. Any input is welcome cheers Martin