Hello: I have a graph with over million nodes and each node may be connected to thousands of edges. My graph is stored in hbase as :
<source, colon_sep_list_of_connected_nodes> I have thousands of such rows in my HBase table. I am facing issue in running standard algorithms such as PageRank, ConnectedComponents because of mapper timeouts. I am able to fix these issues if I reduce number of outgoing edges to few hundreds (by doing partial analysis). While one solution of this issue could be to increase hadoop mapper timeouts or hbase/zk scanner timeouts. I would like to see if giraph is intelligent enough in figuring out the following: 1. In vertex input format of giraph, we create various vertices and edges. What if I split by hbase rows into multiple rows, such that no row have more than X number of neighbours. So: <source, colon_sep_list_of_connected_nodes_part1> <source, colon_sep_list_of_connected_nodes_part2> <source, colon_sep_list_of_connected_nodes_part3> ............................ <source, colon_sep_list_of_connected_nodes_partn> This will create multiple mappers for each row, but I am afraid if giraph will determine that multiple nodes with same id but smaller number of edges are actually the same vertex, with millions of edges. I am also wondering how can I create bidirectional edges in the giraph. Do I have to modify my input tables to contain two rows .. one from a-->b and another from b-->a ... Is it not possible to do by keeping only one record in the table. Thanks Puneet -- --Puneet