At Freebase, we're mapping our large graphs into very large files of triples in HDFS and running large queries over them. Hadoop is optimized for processing streaming data off of disk, and we've found that trying to load a multi-GB graph and then access it in a Hadoop task has scaling problems. Mapping the graph to an on-disk representation as a bunch of interlocking or overlapping subgraphs works very well.

-Colin


Bhupesh Bansal wrote:
Hey guys,

We at Linkedin are trying to run some Large Graph Analysis problems on
Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM
at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-cores
machine with 8G on each.

Whats is the best way of doing that ?? Is there a way so that multiple
mappers on same machine can access a RAM cache ??  I read about hadoop
distributed cache looks like it's copies the file (hdfs / http) locally on
the slaves but not necessrily in RAM ??

Best
Bhupesh


Reply via email to