Thanks Colin/ Owen I will try some of the ideas here and report back.
Best Bhupesh On 10/16/08 4:05 PM, "Colin Evans" <[EMAIL PROTECTED]> wrote: > The trick is to amortize your computation over the whole set. So DFS > for a single node will always be faster on an in-memory graph, but > Hadoop is a good tool for computing all-pairs shortest paths in one shot > if you re-frame the algorithm as a belief propagation and message > passing algorithm. > > A lot of the time, the computation still explodes into n^2 or worse, so > you need to use a binning or blocking algorithm, like the one described > here: http://www.youtube.com/watch?v=1ZDybXl212Q > > In the case of graphs, a blocking function would be to find overlapping > strongly connected subgraphs where each subgraph fits in a reasonable > amount of memory. Then within each block, you do your computation and > you pass a summary of that computation to adjacent blocks,which gets > factored into the next computation. > > When we hooked up a Very Big Graph to our Hadoop cluster, we found that > there were a lot of scaling problems, which went away when we started > optimizing for streaming performance. > > -Colin > > > > Bhupesh Bansal wrote: >> Can you elaborate here , >> >> Lets say I want to implement a DFS in my graph. I am not able to picturise >> implementing it with doing graph in pieces without putting a depth bound to >> (3-4). Lets say we have 200M (4GB) edges to start with >> >> Best >> Bhupesh >> >> >> >> On 10/16/08 3:01 PM, "Owen O'Malley" <[EMAIL PROTECTED]> wrote: >> >> >>> On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote: >>> >>> >>>> We at Linkedin are trying to run some Large Graph Analysis problems on >>>> Hadoop. The fastest way to run would be to keep a copy of whole >>>> Graph in RAM >>>> at all mappers. (Graph size is about 8G in RAM) we have cluster of 8- >>>> cores >>>> machine with 8G on each. >>>> >>> The best way to deal with it is *not* to load the entire graph in one >>> process. In the WebMap at Yahoo, we have a graph of the web that has >>> roughly 1 trillion links and 100 billion nodes. See >>> http://tinyurl.com/4fgok6 >>> . To invert the links, you process the graph in pieces and resort >>> based on the target. You'll get much better performance and scale to >>> almost any size. >>> >>> >>>> Whats is the best way of doing that ?? Is there a way so that multiple >>>> mappers on same machine can access a RAM cache ?? I read about hadoop >>>> distributed cache looks like it's copies the file (hdfs / http) >>>> locally on >>>> the slaves but not necessrily in RAM ?? >>>> >>> You could mmap the file from distributed cache using MappedByteBuffer. >>> Then there will be one copy between jvms... >>> >>> -- Owen >>> >> >> >