It totally depends on the input distribution, one very simple thing that can be 
done is:> Define  a VertexResolver that upon every vertex creation sets its Id 
= domain of url & value = "set" of urls in the domain; it keeps appending as 
more vertices with same id (i.e., domain) are read from input> [Now you can 
ignore edges all together. All you are left with is these huge-vertices that 
are identified by domains & contain value = set of urls] Here, you can use 
aggregator approach of sending in the (domain, count of set) to master -> these 
aggregators are then combined to give something like [(domain1, offset1), 
(domain2, offset2), etc.] all vertices (the huge ones) read this aggregator and 
figure out their offset then while u output just output the vertices in the set 
with number = offset + number in set
So u have a map now -> though it is highly unstable because adding one more url 
to a domain later will change the order totally, that is when u can use id = 
domain + insert date, etc. [[which will stop working at some point because 
aggregator needs to carry huge messages, then the computation of offsets via 
aggregators needs to be done in multiple supersteps, etc.]]
Anyway now that you have the map of url - numberAll you got to do is a join 
->that's simpleread your original table + this map table in a single giraph 
joband u can use 2 supersteps to rename all the vertices properly

[[note that you can do another thing here as well, MUCH SIMPLER THAN ABOVE]> 
superstep 0in your compute class have a thread local variable that increases 
for each vertex the thread computes, assign the value [(workerid,threadid) , 
number] to each vertex.
now aggregate {(workerid, threadid), number}
> superstep 1;master>now we have see [{(workerid, threadid), count in group}]so 
> recompute another aggregator which is like[{workerid, threadid), cumulative 
> sum upto now]send this aggregator to workers
workerread cumulative_sum from aggregator and add it up to each vertex's 
current value
when you output the graph this time as edgeoutput, sourceid, targetid are set 
as vertex values = the count
Date: Tue, 15 Apr 2014 23:40:39 +0200
Subject: Re: Changing index of a graph
From: mneum...@spotify.com
To: user@giraph.apache.org

I have a pipeline that creates a graph then does some transformations on it 
(with Giraph). In the end I want to dump it into Neo4j to allow for cypher 
queries.  
I was told that I could make the batch import for Neo4j a lot faster if I would 
use Long identifiers without holes, and therefore matching there internal ID 
space. If I understand it right they use it to build an on disk index with it 
using the ID's as offsets, that's why it should have no holes.

I didn't expect it to be so costly to change the index, but I guess this way I 
could at least spread the load to the cluster, since batch import happens on a 
single machine.

Thanks 4 the input, I will see what makes the most sense with the limited time 
I have.

On Tue, Apr 15, 2014 at 5:31 PM, Lukas Nalezenec 
<lukas.naleze...@firma.seznam.cz> wrote:


  
    
  
  
    Hi, 

      I did same think in two M/R jobs during preprocesing - it was
      pretty powerful for web graphs but little bit slow.

      

      Solution for Giraph is:

      1. Implement own partition which will iterate vertices in order.
      Use appropriate partitioner.

      2. During first iteration you need to rename vertexes in each
      partition without holes. Holes will be only between partitions.

          At the end, get min and max vertex index for each partion,
      send it to master in aggregator and compute mapping required to
      delete holes.

      3. During second iteration iterate all vertexes and delete holes
      by shifting vertex indexes. 

      

      4. .... rename edges (two more iterations)...

      

      Btw: Why do you need such indexes ? For HLL ?

      

      Lukas

      

      On 15.4.2014 15:33, Martin Neumann wrote:

    
    
      
      Hej,
        

        
        I have a huge edgelist (several billion edges) where node
          ID's are URL's.
        The algorithm I want to run needs the ID's to be long and
          there should be no holes in the ID space (so I cant simply
          hash the URL's).
        

        
        Is anyone aware of a simple solution that does not require
          a impractical huge hash map?
        

        
        My idea currently is to load the graph into another giraph
          job and then assigning a number to each node. This way the
          mapping of number to URL would be stored in the Node.
        Problem is that I have to assign the numbers in a
          sequential way to ensure there are no holes and numbers are
          unique. No Idea if this is even possible in Giraph.
        

        
        Any input is welcome
        
          

        
        cheers Martin
      
    
    

  


                                          

Reply via email to