Re: Breadth-first search
We are running several Giraph applications in production using our version of Hadoop (Corona) at Facebook. The part you have to be careful about is ensuring you have enough resources for your job to run. But otherwise, we are able to run at FB-scale (i.e. 1billion+ nodes, many more edges). Avery On 12/11/12 5:58 AM, Gustavo Enrique Salazar Torres wrote: Hi: I implemented a graph algorithm to recommend content to our users. Although it is working (implementation uses Mahout) it very inefficient because I have to run many iterations in order to perform a breadth-first search on my graph. I would like to use Giraph for that task. I would like to know if it is production ready. I'm running jobs on Amazon EMR. Thanks in advance. Gustavo
Re: Breadth-first search
Hi Avery: Regarding resources I guess I won't need that much, our graph has 60,000 nodes only, I believe one c1.xlarge EC2 machine can handle this or scale if needed. Thank you very much. Gustavo On Tue, Dec 11, 2012 at 4:40 PM, Avery Ching ach...@apache.org wrote: We are running several Giraph applications in production using our version of Hadoop (Corona) at Facebook. The part you have to be careful about is ensuring you have enough resources for your job to run. But otherwise, we are able to run at FB-scale (i.e. 1billion+ nodes, many more edges). Avery On 12/11/12 5:58 AM, Gustavo Enrique Salazar Torres wrote: Hi: I implemented a graph algorithm to recommend content to our users. Although it is working (implementation uses Mahout) it very inefficient because I have to run many iterations in order to perform a breadth-first search on my graph. I would like to use Giraph for that task. I would like to know if it is production ready. I'm running jobs on Amazon EMR. Thanks in advance. Gustavo
Problems running PageRank benchmark
Hi: I checked out release 0.1 and ,after compiling it, I tried to run this line: hadoop jar giraph-0.1-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 5000 -w 2 but job took too long to finish: Using org.apache.giraph.benchmark.PageRankBenchmark$PageRankHashMapVertex 12/12/11 18:15:49 WARN bsp.BspOutputFormat: checkOutputSpecs: ImmutableOutputCommiter will not check anything 12/12/11 18:15:49 INFO mapred.JobClient: Running job: job_201212111806_0003 12/12/11 18:15:50 INFO mapred.JobClient: map 0% reduce 0% 12/12/11 18:16:06 INFO mapred.JobClient: map 33% reduce 0% 12/12/11 18:21:11 INFO mapred.JobClient: Job complete: job_201212111806_0003 12/12/11 18:21:11 INFO mapred.JobClient: Counters: 5 12/12/11 18:21:11 INFO mapred.JobClient: Job Counters 12/12/11 18:21:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=619750 12/12/11 18:21:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/12/11 18:21:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/12/11 18:21:11 INFO mapred.JobClient: Launched map tasks=2 12/12/11 18:21:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=4181 I checked logs for one of the map tasks and found this: 2012-12-11 18:21:03,929 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: No response from partition 2 (could be master) 2012-12-11 18:21:03,929 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 1 of 2 required). This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers. 2012-12-11 18:21:03,932 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {_stateKey:FAILED,_applicationAttemptKey:-1,_superstepKey:-1} on superstep -1 2012-12-11 18:21:04,018 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201212111806_0003 2012-12-11 18:21:04,142 INFO org.apache.giraph.graph.BspServiceMaster: cleanup: Notifying master its okay to cleanup with /_hadoopBsp/job_201212111806_0003/_cleanedUpDir/0_master 2012-12-11 18:21:04,159 INFO org.apache.giraph.graph.BspServiceMaster: cleanUpZooKeeper: Node /_hadoopBsp/job_201212111806_0003/_cleanedUpDir already exists, no need to create. 2012-12-11 18:21:04,161 INFO org.apache.giraph.graph.BspServiceMaster: cleanUpZooKeeper: Got 1 of 3 desired children from /_hadoopBsp/job_201212111806_0003/_cleanedUpDir 2012-12-11 18:21:04,161 INFO org.apache.giraph.graph.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children of /_hadoopBsp/job_201212111806_0003/_cleanedUpDir to change since only got 1 nodes. 2012-12-11 18:21:05,013 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process. My hadoop version is 1.0.3. Is there any special configuration to be done? Can anybody help me? Thanks Gustavo
Re: Breadth-first search
Hi Gustavo, If your graph fits in memory, you might be interested Green-Marl, a language tailored for graph processing: https://github.com/stanford-ppl/Green-Marl You can compile your Green-Marl program to an extremely fast C++ program, but also to Giraph program when your graph does not fit in memory anymore. - Jan On Tue, Dec 11, 2012 at 8:33 PM, Gustavo Enrique Salazar Torres gsala...@ime.usp.br wrote: Hi Avery: Regarding resources I guess I won't need that much, our graph has 60,000 nodes only, I believe one c1.xlarge EC2 machine can handle this or scale if needed. Thank you very much. Gustavo On Tue, Dec 11, 2012 at 4:40 PM, Avery Ching ach...@apache.org wrote: We are running several Giraph applications in production using our version of Hadoop (Corona) at Facebook. The part you have to be careful about is ensuring you have enough resources for your job to run. But otherwise, we are able to run at FB-scale (i.e. 1billion+ nodes, many more edges). Avery On 12/11/12 5:58 AM, Gustavo Enrique Salazar Torres wrote: Hi: I implemented a graph algorithm to recommend content to our users. Although it is working (implementation uses Mahout) it very inefficient because I have to run many iterations in order to perform a breadth-first search on my graph. I would like to use Giraph for that task. I would like to know if it is production ready. I'm running jobs on Amazon EMR. Thanks in advance. Gustavo