Re: Distributed cache Design

Bhupesh Bansal Thu, 16 Oct 2008 16:17:26 -0700

Thanks Colin/ Owen

I will try some of the ideas here and report back.


Best
Bhupesh



On 10/16/08 4:05 PM, "Colin Evans" <[EMAIL PROTECTED]> wrote:

> The trick is to amortize your computation over the whole set.  So DFS
> for a single node will always be faster on an in-memory graph, but
> Hadoop is a good tool for computing all-pairs shortest paths in one shot
> if you re-frame the algorithm as a belief propagation and message
> passing algorithm.
> 
> A lot of the time, the computation still explodes into n^2 or worse, so
> you need to use a binning or blocking algorithm, like the one described
> here:  http://www.youtube.com/watch?v=1ZDybXl212Q
> 
> In the case of graphs, a blocking function would be to find overlapping
> strongly connected subgraphs where each subgraph fits in a reasonable
> amount of memory.  Then within each block, you do your computation and
> you pass a summary of that computation to adjacent blocks,which gets
> factored into the next computation.
> 
> When we hooked up a Very Big Graph to our Hadoop cluster, we found that
> there were a lot of scaling problems, which went away when we started
> optimizing for streaming performance.
> 
> -Colin
> 
> 
> 
> Bhupesh Bansal wrote:
>> Can you elaborate here ,
>> 
>> Lets say I want to implement a DFS in my graph. I am not able to picturise
>> implementing it with doing graph in pieces without putting a depth bound to
>> (3-4). Lets say we have 200M (4GB) edges to start with
>> 
>> Best
>> Bhupesh
>> 
>> 
>> 
>> On 10/16/08 3:01 PM, "Owen O'Malley" <[EMAIL PROTECTED]> wrote:
>> 
>>   
>>> On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:
>>> 
>>>     
>>>> We at Linkedin are trying to run some Large Graph Analysis problems on
>>>> Hadoop. The fastest way to run would be to keep a copy of whole
>>>> Graph in RAM
>>>> at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-
>>>> cores
>>>> machine with 8G on each.
>>>>       
>>> The best way to deal with it is *not* to load the entire graph in one
>>> process. In the WebMap at Yahoo, we have a graph of the web that has
>>> roughly 1 trillion links and 100 billion nodes. See
>>> http://tinyurl.com/4fgok6
>>>   . To invert the links, you process the graph in pieces and resort
>>> based on the target. You'll get much better performance and scale to
>>> almost any size.
>>> 
>>>     
>>>> Whats is the best way of doing that ?? Is there a way so that multiple
>>>> mappers on same machine can access a RAM cache ??  I read about hadoop
>>>> distributed cache looks like it's copies the file (hdfs / http)
>>>> locally on
>>>> the slaves but not necessrily in RAM ??
>>>>       
>>> You could mmap the file from distributed cache using MappedByteBuffer.
>>> Then there will be one copy between jvms...
>>> 
>>> -- Owen
>>>     
>> 
>>   
>

Re: Distributed cache Design

Reply via email to